218: Balancing test coverage with test costs - Nicole Tietz-Sokolskaya

Brian:

Today on the show, I have Nicole, and we're gonna talk about testing in Python and a whole bunch of fun stuff. Welcome, Nicole.

Nicole:

Yeah. Thanks for having me on, Brian.

Brian:

Before we jump in, I'd like to kinda introduce who you are and what you do.

Nicole:

Yeah. Absolutely. I am Nicole Thietz Sikolskaya. I go by Ntietz, almost placed on the web, and I'm a principal software engineer at a tiny start up called Remesh. So most of my focus there is performance, scaling, back end stuff.

Nicole:

And so we use Python for a lot of our back end code and our machine learning code. And then we also have a myriad of, like, TypeScript, Go, and Rust, and some other places as it calls for it. So these days, I split my time between some of the machine learning code in Python and some of the Rust code, for any of the things that just need to be wicked fast.

Brian:

Awesome. Cool. Learning Rust is something on my my to do list, and Nicole has a resource that we'll link called yet another Rust resource. And, looks great. And what was the goal of that?

Brian:

You said it was, trying to get people started really quickly?

Nicole:

Yeah. So when we were introducing Rust at work, I knew that, like, the traditional path to learning Rust was here. Here's a 1,000 page Rust programming language book. Go go read it, and I needed people to be up to speed a little bit quicker than that. So the goal of this course is to get you to be proficient enough to pair with someone who knows Rust after just a few days and then kind of take it from there to deepen your learnings later, but, generally, just kinda reduce the intimidation factor and get you running.

Brian:

Per proficient enough to pair. Do you have does your organization do pair programming?

Nicole:

We do it kind of on an ad hoc basis when mostly when, like, situations call for it or we have an interesting problem. But as a remote first organization, we've never really settled into a very strong pattern of doing it regularly.

Brian:

Okay. Yeah. That's kind of my comfort level with, pair programming anyways on a as needed basis. Cool. So I ran across your blog post called too much a good thing, too much of a good thing, the trade off we make with tests.

Brian:

And it talked about there's just there's a lot of stuff it talks about, but one of them is balancing, risk mitigation and and, basically, how much you wanna test and to you know, that sort of thing. So can you introduce the topic for us really about testing and coverage and whatnot?

Nicole:

Yeah. I mean, I think, like, often as software engineers, we don't have a very nuanced discussion about testing because a lot of it comes down to, like, product pushes back on testing and wants a faster schedule. Engineers push back on a faster schedule and want more testing. And this is kind of calling for taking a step back and thinking about, like, what is the actual reason that we're doing these tests, and what are we trying to get out of them? And, like, is there a point of diminishing return?

Nicole:

How do we know how much testing is the right amount of testing? And, like like like you mentioned I I mentioned code coverage in there. Like, if we're measuring it, what's the point of that, and and, like, should we track it? So, yeah, just kinda went through a number of things. But, yeah, I don't know if if you wanna dive into anything in particular from there.

Brian:

Well, let's let's hit the big one that is contentious sometimes is is code coverage. Yeah. What's your do you have an opinion around code coverage, or should we measure it?

Nicole:

Yeah. So the one the the first thing I did at my internship, first job as a software engineer of any flavor, was adding code coverage to the team that I was working on and then increasing their code coverage throughout the summer. So, like, I got started in a culture that was very pro code coverage, and then over time, I started to notice that, like, when you have this environment where you track code coverage, there's this tendency to push it up uncritically, and you start to do really funny things. Like, if I refactor something and I reduce the lines of code in it, but that was used in a lot of the tests or or, like, then you've now reduced the code coverage percentage because you made something smaller. So if you have something like a code coverage ratchet, then you start to have funny side effects of, like, okay.

Nicole:

I I refactored this thing, but I need to go add test somewhere else so I don't reduce the the test amount? And that's where I think, like, it has to be the incentives.

Brian:

That one, like, threw me, and I'm like, well, that's a really good point. So I I put took it to an extreme with a mental, like, a mental exercise of, let's say I've got, what, a hundred lines of code with 2 50 line functions, and I've got 2 tests around that. And and for some reason, let's say, I only have 50% coverage. So I'm covering 5050 lines of code. If I refactor that and make it tighter and make that make that like a 40 lines of code, and, and then add 10 to the other one, I'm suddenly, like, shifted, and I've got, like, 40% code coverage even though I've just I I haven't made things worse.

Brian:

I've actually just tightened up the code. It's a it's a weird thing to think about that. Like, yeah. You can't really if you're reducing the number of lines of code, you can't really then the the the you're gonna change the coverage percent. Right?

Nicole:

Yeah. So Yes.

Brian:

And you we want that. We want people to make tighter code. But, anyway, it's the, the if you if you don't have a 100% coverage, then the the number is hard to deal with, I guess. I'm k. I'm personally for my personal projects, anything that I have complete control over, I am a 100% code coverage kind of person, but I I use the exceptions liberally.

Brian:

Like, if it's, like, if I'm using a a, third party code or even if I've vendored in some code, I know the lines of code that I'm using, and I'm not gonna try to test everything else. So I do have specifics that like, this is the code I know I'm using, and I want a 100% coverage of that. Do what's your so you you were on that project earlier. Do you still I guess, you brought up the the fact that it's the maybe the number and trying to reach a 100% isn't that great. But, you know, culturally, where you stand now?

Brian:

What how do you use it?

Nicole:

Yeah. So right now, I don't really use code coverage. I think that it could be a good tool to, like, see integrated in alongside a code change. Like, is the code that this is affecting, is that covered by tests? Which I think would be really useful.

Nicole:

So I think, like, contextually, what code is covered is to me more important than the raw number. But on projects I work on, we just haven't, like, taken that leap, where we look more at more at end to end signals, I guess, of correctness and performance and things. But, yeah, a friend was also telling me last week about a project that he was on where the manager of the team wanted a 100% code coverage. But it was a React app, and they had, style components in it, which meant that your code coverage, getting to a 100%, required every single line of CSS to be covered as well. That's like, what's the meaningful test here?

Nicole:

And and that was also, like, a a weird pattern where it's like like you were saying, like, exceptions to it have to be made for either third party things or things that might not make sense to test in this context. Or I would even argue, like, CSS things, you could write a test around that, but I don't think it's a particularly high value test, and your effort may be better spent somewhere else. Yeah.

Brian:

And I gotta caveat my opinions with I don't test any front end stuff, and I'm Yeah. I kinda always tested the API down. The other thing is I don't really the only thing I use code coverage for is coverage based on behavior end to end tests. I don't think code coverage on unit tests was that use that interesting to me Mhmm. Because, be because of some of the abuses that you probably have seen that you, there can be a there can be a piece of code that can't even be reached by the system that you can write a test for, and you'll never know to delete it because it's being covered.

Brian:

Right? So

Nicole:

Yeah. So in those end to end tasks, how do you get coverage on, like, for example, the error paths when something exceptional happens, downstream service is broken, or a really strange error happens.

Brian:

In those cases, I think those particular parts are the the concept of mocking or something. Sounds good. I don't use mocking on any of my, professional projects, but I'll design into the architecture a system where we can simulate all the failures that we wanna be able to cover. So it's the, like I don't know. In the in the, in the case of, like, a a Internet service or something, an equivalent of the different error codes that we expect to be able to get, and recover gracefully.

Brian:

Those are things that you wanna know. That is a that's not really an air condition. That's a behavior. Right? Mhmm.

Brian:

That you that you want your system to handle. So, yeah, it's gotta be tested. But it's some some air conditions are really hard to get naturally without Yeah. Forcing the hand. Yeah.

Brian:

So how do you deal with it?

Nicole:

I don't. Like, what we do we do testing in smaller units for things that are handling upstream services using mocks, where we where we can. And then that's a lot of, like, hope that your mock interface and your actual thing return similar responses. And in in Python or dynamically typed languages, this is a sore point for me, because, like, I don't have a type system telling me that my mock is the same thing. But other than that, I mean, it it comes down to, like, the that value trade off of how likely do you think this is, and and if it does happen, how much do you care?

Nicole:

So, like, if your service that you're relying on historically has been up the vast majority of the time, then maybe it's okay to assume that and, like, swallow a 500 error on the rare case that goes down just depending on how much effort it is to actually test that. I think that's, like, a case to case case by case basis.

Brian:

So it depends on the service, right, or or what what you're building.

Nicole:

Yeah. I just like in those cases, what I would like is I would like alerting so that if that's happening rather than, like, a test that, like, okay. We we we recover it, but just like an alert that, oh, we're getting a lot of 5 hundreds from this critical service. Maybe someone should wake up and look at that.

Brian:

You you brought up, like, the problem with mocks and and, basically, the Drift API Drift or something that your Mac doesn't match. And I can't remember I can't remember the keyword, but there is a way at least with the the the Python unit test mock library to to say, make it match this API, and it should always match it or something. But I can't remember what that's called. Oh,

Nicole:

that's great. I should I'll I'll definitely look into that. And and That

Brian:

feature that I was trying to think about during the interview, of course, is is the auto spec feature of Mach. The other the other thing that I wanted to bring up, I guess, was was the the risk part of, like, we're testing because we wanna we wanna mitigate risk. Right? Yeah. I think that's why.

Brian:

So so how do you value how do you decide what to test and what not to test then?

Nicole:

Yeah. I mean, like, a a very clear example of this from my professional experience is our platform at work, if a particular portion of it that people are interacting with live, if that goes down and we don't recover in, like, under a minute, then that's potentially a lot of money lost for both us and for our customers. Whereas, like, the analysis features, it's really nice if they're up, but it's not it doesn't matter quite as much if they go down because, like, people can wait. So it's not a timely thing, and that's where we made a deliberate trade off at one point that this slice of features is the the slice that we really, really need to know if anything was going to break it, and that's where we devoted most of our efforts for performance testing, just robustness testing for all of the changes in it. And then you have a lot more other small bugs slipped through in the other part that aren't necessarily as impactful.

Nicole:

And, obviously, in, like, a monolith, you're gonna have interplay between the different parts. You can't isolate them perfectly, but we were able to target our efforts based on, like, if there's a major bug, which part of this is most impactful to the business? Yeah. And and that's where we started.

Brian:

I think that's a great a great way to think about it. Also, one of the things you brought up was performance testing and especially in, like, end user services.

Nicole:

Mhmm.

Brian:

Performance is it's important because if it's too slow, people will just assume it's broken. Right? Yeah. But performance is tough because it, like, it is a little wiggly. How do you deal with that?

Nicole:

Yeah. I mean, it's it's really wiggly, and it also depends so highly on your workload. So if your workload is not realistic, your performance test isn't really testing anything for you. And, like, it'll give you broad struct direction, but not not a lot of useful information. So this is one I have another blog post about.

Nicole:

It's, like, why is load testing so hard. But the the crux of it, I think, is just that, like, it has to match what your actual behavior is, and that also you don't know you you don't know what the actual workload is gonna be until you've deployed it into production. You can guess how people are gonna use it, but you you can't get that real workload until it's in production. So there's some mismatch always there, but, like, ultimately, you have to try to simulate in an ideal case, simulate, like, end to end for the entire system, what is the workload that users are gonna put on it. Because if you test different services in isolation, you're not capturing interplay and, like, nonlinear interactions between the different components of the system.

Nicole:

And So do you do

Brian:

you utilize monitoring to try to figure that out? Or

Nicole:

Yeah. So we we like to look at our monitoring to make sure that, like, the actual behaviors we're seeing match what we're doing in our testing for performance stuff. There's some really interesting research out there, I think, from telecom companies in particular that I read when I was starting this project a few years back, where they were talking about actually generating synthetic workloads automatically from monitoring.

Brian:

Interesting.

Nicole:

As far as I know, it has not been put into practice outside telecoms. This is also really expensive. So circling back to the risk discussion, unless you have a whole lot of money riding on the line, if your system goes down, it's just not worth it. Whereas, like, if you're a telecom, you're an emergency service for the whole country, so you better stay up and throw money at it to make sure you do.

Brian:

Yeah. Also, one of the things you you brought up kind of, we we hinted at so far, but you brought up directly in the article as, the trade offs between the cost of cost of a some downtime or a cost of a service breaking versus the cost of writing tests. Mhmm. And then also, one of the things that I'm very cognizant of is the cost of maintaining tests. Because, it always feels good to get it, like, good, you know, good coverage, good behavior coverage, and then also, a large suite of tests.

Brian:

You feel you're comfortable, but that that large suite of tests is also kind of a beast to turn if you have to refactor or things change. You have to maintain test code just like you maintain the rest of your code.

Nicole:

Yeah. And I I think that's where when when you were saying you're aiming for a 100% code coverage on, like, the end to end testing, I think you have a little bit better time there with refactoring because if you change internal stuff, you're not gonna break the test as much because they're not as tightly coupled to it. Whereas if you have, like, really high coverage on unit tests, it's so tightly coupled to the actual structure of the code that refractors do get very into the weeds on changing the tests. And another cost of those that I didn't put in the article is just the time it takes to run them. So Yeah.

Nicole:

Like, as you get more and more tasks, you're either gonna pay for more compute to run them faster, or you're gonna wait longer, and, that gets really frustrating.

Brian:

Yeah. And it's interesting with, like, in little even little tiny things like, you know, Python libraries or a Pytest plug in or some little extra feature, we've got we've kind of gotten lazy. I think some of us have gotten a little bit lazy with CI and say, well, you know, even if, it's okay if it's it's just a few minutes. And and then, like but I'm running it on, on 6 different versions of Python on 3 or 4 different hardware platforms

Nicole:

Mhmm.

Brian:

And that multiplies it out. And even if those run-in parallel, that's a little that's a lot of compute power when sometimes it doesn't matter. Like, I've seen I've seen Python libraries that are tested over, like, a ton of versions of Python, and they're not utilizing that. They don't really need to. They could they could pin it like the the upper and lower and probably be fine.

Brian:

There's some some risk and and benefit there. And also, I mean, if you really aren't hardware specific, I don't think you really need to run on multiple hardware platforms all the time. There's a lot of pure Python libraries that are testing like tested like that that I don't think need to be. So

Nicole:

Yeah. And you can you can have different configurations for different changes. So, like, as you have changes come in, maybe you wanna test them on just the oldest and newest. But then when you've caught a major release or just periodically test on more so that you catch those rare changes, but you don't have to do it every single time.

Brian:

So, just out of curiosity, what's your, you know, you don't need to share with me if you don't want to, obviously, but what would you consider a short test suite, and what's a long test suite? What's what's what's kinda too long?

Nicole:

I think if I can get up and go make a coffee, it is probably too long. So I would say, like, 5 minutes is too long. But, like, realistically, for me, it also depends on is my ADHD medication in effect or not. Because, like, if it's not in effect, then if I if it's not done before I look away from the terminal, I'm somewhere else. It doesn't matter how long it takes.

Nicole:

But if it is an effect, I can sit there and wait a couple minutes for it. So I think, like, single digit minutes is pretty reasonable. Double digits is, like, this has a major effect on your team and your productivity. What about you?

Brian:

Well, okay. So I have I don't have a option there with, with a lot of since I'm working with a lot of hardware stuff on it daily. But the the thing that we do is try to modularize our tests so that a particular test module or test directory or something, can be worked on, and that that bit is, under just a few minutes. So that that, like you said, that development workflow, if you're working in this area, you shouldn't have to wait for 10 minutes. A a few minutes is even kinda long.

Brian:

So I'd love it to be under a minute, for for something that I'm working on a day to day basis. But then once once I think, oh, this is good, and I push it to merge, I'm okay with, like, you know, 10, 15 minutes if necessary in the CI, because it's I probably caught it locally anyway, so the CI is really just having my back in case I broke something that I didn't mean to, things like that. So Yeah. The multiple layers, I think, is good to be able to say, hey, development workflow needs to be fast, but, we also need to test thoroughly as well. And I think that anybody listening that thinks, like, even a minute is way too long because you need to be able to test with every keystroke.

Brian:

That's crazy. And I don't think people should have to would try to worry about that. But maybe, I don't know, maybe Rust people can because Rust is so fast.

Nicole:

Yeah. I mean I mean, in Rust, your your test can be super fast. You're just waiting for the compile time before you can run them.

Brian:

Oh, right. Okay. Yeah. I forget that it's compiled. So

Nicole:

It it's compiled, and it's also, like have you have you have you run Go?

Brian:

Yeah.

Nicole:

So so Go's compiles are incredibly, incredibly fast. And, have you done c plus plus?

Brian:

Yeah.

Nicole:

Okay. So so those are pretty slow, and then Rust is like, okay. We're we're waiting. We're waiting a while. It's it's not fast.

Nicole:

They're they're they're making efforts to get it faster, but that's definitely one of the sore points for for Rust. And, like, one of the advantages of an interpreted language like Python is, like, yeah, I can just run it, and it's there.

Brian:

Yeah. One of the cool things about Python is it's compiled anyway, but nobody realizes it because we just don't see it.

Nicole:

Yeah. I mean I mean, that that leads to a really question. It's like, what does it mean to be compiled? Because, like, to me, if you have a byte code like, if to me, I think it's the motion you go through that makes a meaningful difference. It's like, does it do I run a script directly, or do I have a separate compile step?

Brian:

Yeah. So do you have a compile step with Go?

Nicole:

It's up to you. You can you can certainly run scripts without it, but you can also have it pump out a binary that you can then run separately.

Brian:

I guess the well, like, the only thing I run is something built so I run Hugo, which is built with Go, but I don't I don't actually compile Hugo. I just run it. So

Nicole:

Yeah. Yeah. So the the compile stuff there, like, you can do go run, I think it is. It's been a while since I ran it directly, which will compile the source and then run it. Or you can, like, do a separate compile, get the the distributable binary, and ship it to someone, and then they can run it.

Brian:

But one of the lovely things about all these things that are fast is it's helping out, like, it's helping out Python. So, at least it especially Rust is helping make Python faster, which is neat. So Yeah.

Nicole:

I haven't had the opportunity to to use Rust across the, like, Python boundary, But it's really cool, and it warms my heart that we can do things more safely with a little bit less c in the world.

Brian:

Yeah. Yeah. Well, okay. So I hopefully, I can, I can agree with you at some point in the future? Half of my, like, you know, my paid gig is c plus plus.

Brian:

So, I don't wanna throw out Yeah. C plus plus altogether.

Nicole:

My my condolences. I I was traumatized by c plus plus in a previous job.

Brian:

I'm sorry. It is, Yeah. It is painful. The, you were talking asking me about compile times, and, compile times are to the point now where well, ours is pretty fast. I mean, relatively.

Nicole:

Mhmm.

Brian:

We can count it in minutes at least. So whatever.

Nicole:

Yeah.

Brian:

Anyway, well, Nicole, it's been lovely talking to you about testing. We're gonna link to the at least the, let's see, too much of a good thing. You also brought up brought up why is, why is load testing so hard. Mhmm. We'll totally link to that.

Brian:

And, and then also your your Rust tutorial. And, also, I'll, I can't wait to get this get started with this. So Yeah. But I

Nicole:

think we can also drop in a link for Goodhart's Law, which we kinda, like, danced around with code coverage being a bad measure, but that makes it explicit. Why?

Brian:

Yeah. I'd yeah. The only thing I wanted to throw in is one of the reasons why I kind of taunt with the whole 100% code coverage, is I utilize it mostly to find out what code to delete. My bet my favorite way to get increased coverage is to remove code that can't be reached.

Nicole:

Yeah.

Brian:

So anyway

Nicole:

Yeah. I love that.

Brian:

But people freak out. Like, when I delete code, people are like, but we need that. Like, prove that you prove to me that we need that, and and I'll put it back.

Nicole:

Also, I I hope it's in version control. So if you ever need it, it's still there.

Brian:

Yeah. Exactly. That's why we have it. Oh, yeah. Cannot stand to see commented out code.

Brian:

We might need this later. We'll get it later if we find need it. Don't don't comment out code. I mean, for a short period of time, it's fine, but it's cringey. So okay.

Brian:

Thanks a ton, Nicole, and, it was good talking with you.

Nicole:

Yeah. You as well. Thanks so much for having me on. This was a lot of fun.

Creators and Guests

Brian Okken
Host
Brian Okken
Software Engineer, also on Python Bytes and Python People podcasts
Nicole Tietz-Sokolskaya
Guest
Nicole Tietz-Sokolskaya
Nicole is a software engineer and writer. She works as a principal software engineer at Remesh where her main responsibilities are performance, security, and backend systems using Python and Rust. She writes frequently about myriad topics on her blog. Outside of computers, she's kept busy with her family and all of life's other responsibilities.
218: Balancing test coverage with test costs -  Nicole Tietz-Sokolskaya
Broadcast by