> you should delete tests for everything that isn't a required external behavior
Wait, I'm terribly confused here.
Aren't a huge part of tests to prevent regression?
In attempting to fix a bug, that could cause another "internal" test to fail and expose a flaw in your bugfix that you wouldn't have caught otherwise. And it's not uncommon for your flawed bugfix to not cause an "external" test to fail, because it's related to the codepath there was never a good enough external test for in the first place -- hence why the bug existed.
I can't imagine why you would ever delete tests prematurely. I mean, running internal tests is cheap. I see zero benefit for real cost.
And not only that, when devs don't document the internal operation of a module sufficiently, keeping the tests around serves as at least a kind of minimal reference of how things work internally, to help with a future dev trying to figure it out.
If you're refactoring an implementation, then obviously at that point you'll delete the tests that no longer apply, and replace them with new ones to test your refactored code. But why would you delete tests prematurely? What's the benefit?
> In attempting to fix a bug, that could cause another "internal" test to fail and expose a flaw in your bugfix that you wouldn't have caught otherwise.
If an external test passes and an internal test fails, the external test isn't really adding any value, is it? And if the root of your issue is "What if test A doesn't test the right things", doesn't the whole conversation fall apart (because then you have to assume that about every test)?
IME this is a common path most shops take. "We have to write tests in case our other tests don't work." Which is a pretty bloaty and wildly inefficient strategy to "Our tests sometimes don't catch bugs." Write good tests, manage and update them often. Don't write more tests to accommodate other tests being written poorly.
> I mean, running internal tests is cheap.
Depends on your definition of cheap, I guess.
My last job was a gigantic rails app. Over a decade old. There were so many tests that running the entire suite took ~3 hours. That long of a gap between "Pushed code" and "See if it builds" creates a tremendous amount of problems. Context switching is cost. Starting and unstarting work is cost.
I'm much more of the "Just Enough Testing" mindset. Test things that are mission critical and complex enough to warrant tests. Go big on system tests, go small on unit tests. If you can, have a different eng write tests than the eng that wrote the functionality. Throw away tests frequently.
I understand what you're saying, but in my experience that's not very robust.
I've often found that an internal function might have a parameter that goes unused in any of the external tests, simply because it's too difficult to devise external tests that will cover every possible internal state or code path or race condition.
So the internal tests are used to ensure complete code coverage, while external tests are used to ensure all "main use cases" or "representative usage" work, and known frequent edge cases.
That doesn't mean the external tests aren't adding value -- they are. But sometimes it's just too difficult to set up an external test to guarantee that a deep-down race condition gets triggered in a certain way, but you can test that explicitly internally.
It's not that anyone is writing tests poorly, it's just that it simply isn't practically feasible to design external tests that cover every possible edge case of internal functionality, while internal tests can capture much of that.
And if your test suite takes 3 hours to run, there are many types of organizational solutions for that... but this is the first I've everd heard of "write less tests" being one of them.
> I've often found that an internal function might have a parameter that goes unused in any of the external tests,
It seems that you're still thinking about "code". What if you thought about "functionality"? If an external test doesn't test internal functionality, what is it testing?
> But sometimes it's just too difficult to set up an external test to guarantee that a deep-down race condition gets triggered in a certain way, but you can test that explicitly internally.
I would argue that if you're choosing an orders of magnitude worse testing strategy because it's easier, your intent is not to actually test the validity of your system.
> while internal tests can capture much of that.
We can agree to disagree.
> And if your test suite takes 3 hours to run, there are many types of organizational solutions for that... but this is the first I've everd heard of "write less tests" being one of them.
I was speaking about a real scenario that features a lot of the topics that you're describing. My point was not that it was good, my point was that testing dogmatism is very real and has very real costs. To describe writing/running lots of (usually unnecessary) tests as "cheap" is a big red flag.
Not the poster you replied to, but I've been thinking of it lately in a different way. Functional tests show that a system works, but if a functional test fails, the unit test might show where/why.
Yes, you'll usually get a stack trace when a test fails, but you might still spend a lot of time tracing exactly where the logical problem actually was. If you have unit tests as well, you can see that unit X failed, which is part of function A. Therefore you can fix the problem quicker, at least for some set of cases.
Internal code A has 5 states, piece B has 8 states.
Testing them individually requires 13 tests.
Testing them from out the outside, requires 5x8=40 tests.
Now, if you think of it that way, maybe you _do_ want to test the combinations because that might be source of bugs. And if you do it well, you don't actually need to write 40 tests, you have have some mechanism to loop through them.
But the basic argument is that the complexity of the 40 test-cases is actually _more_ than the 13 needed testing the internal parts as units.
FWIW, my own philosophy is to write as much pure-functional, side-effect free code that doesn't care about your business logic as possible, and have good coverage for those units. Then compose them into systems that do deal with the messy internal state and business-logic if statments that tend to clutter real systems, and ensure you have enough testing to cover all branching statements, but do so from an external to the system perspective.
I've got the impression that you are both talking slightly past each other.
At least my impression is that these "internal tests" you talk about are valid unit tests -- but not for the same unit. We build much of our logic out of building blocks, which we also want to be properly tested, but that doesn't mean we have to re-test them on the higher level of abstraction of a piece of code that composes them.
From that thought, it's maybe even a useful "design smell" you could watch out for if you encounter this scenario (in that you could maybe separate your building blocks more cleanly if you find yourself writing a lot of "internal" tests)?
Isn't the idea with unit testing forgotten here? The point is to validate the blocks you build and use to build the program. In order to make sure you've done each block right you test them, manually or automated... Automated testing just is generally soooo much easier. If you work like that, and do not add test after y,ou've written large chunks of code you should have constructed your program so that there's no overhead in the test. Advanced test which does lots of setup and advanced calculations generally ain't the test fault, but the code itself that requires that complexity to be tested.
Wanna underline here that system tests are slow, unit tests are fast.
This said, i agre that you should throw away tests in a similar fashion as you do code. When it does not make sense don't be afraid to throw it, but have enough left to define the function of the code, in a documenting way. Let the code /tests speak! :D
Imo the value of unit tests is partially a record for others to see "hey look, this thing has a lot of it's bases covered".
Especially if you're building a component that is intended to be reused all over the place, would anyone have confidence in reusing it if it wasn't at least tested in isolation?
If the test suite took hours, couldn't part of the problem be that a lot of those tests should have been more focused unit tests? With small unit tests and mocking, you could run millions of tests in 3 hours.
There were all kinds of problems with the test suite that could've been optimized. The problem was that there were too many to manage, and that deleting them was culturally unacceptable.
Lots of them made real DB requests. It's hard to get a product owner to justify having devs spend several months fixing tests that haven't been modified in 9 years.
If it can cause a regression, it's not internal. My rule of thumb is "test for regression directly", meaning a good test is one that only breaks if there's a real regression. I should only ever be changing my unit tests if the expected behavior of the unit changes, and in proportion to those changes.
A well-known case is the Timsort bug, discovered by a program verification tool. Also well known is the JDK binary search bug that had been present for many years. (This paper discusses the Timsort bug, and references the binary search bug: http://envisage-project.eu/proving-android-java-and-python-s...)
In both cases, you have an extremely simple API, and a test that depends on detailed knowledge of the implementation, revealing an underlying bug. Obviously, these test cases, when coded, reveal a regression. Equally obviously, the test cases do test internals. You would have no reason to come up with these test cases without an incredibly deep understanding of the implementations. And these tests would not be useful in testing other implementations of the same interfaces, (well, the binary search bug test case might be).
In general, I do not believe that you can do a good job of testing an interface without a good understanding of the implementation being tested. You don't know what corner cases to probe.
Using implementation to guide your test generation ("I think my code might fail on long strings") is fine, even expected. Testing private implementation details ("if I give it this string, does the internal state machine go through seventeen steps?") is completely different.
That's not what he's saying. He's saying the test should measure an externally visible detail. In this case that would be "is the list sorted". This way the test will still pass without maintenance if the sorting algorithm is switched again in the future. You can still consider the implementation to create antagonistic test cases.
One of my colleagues helped find the Timsort bug and recently another such bug (might be the Java binary search, don't remember).
The edge case to show a straightforward version of that recent bug basically required a supercomputer. The artifact evaluation committee complained even.
So you can try to test for that only based on output. But it's gigantically more efficient to test with knowledge of internals.
this sounds like a case where no amount of unit testing ever would've found the bug. someone found the bug either through reasoning about the implementation or using formal methods and then wrote a test to demonstrate it. you could spend your entire life writing unit tests for this function and chances are you would never find out there was an issue. i'd say this is more of an argument for formal methods than it is for any approach to testing.
One doesn't need to have detailed knowledge of the implementation, but merely if provided initial state creates invalid output then we can write test for that. Though yes, having knowledge of implementation allows you to define the state that produces invalid result.
Fair enough. And how do you know, before causing a regression, whether your test could detect one? In other words, how can you tell beforehand whether your test checks something internal or external?
"External" functionality will be behavior visible to other code units or to users. If you have a sorting function, the sorted list is external. The sorting algorithm is internal. Regression tests are often used in the context of enhancements and refactorings. You want to test that the rest of the program still behaves correctly. Knowing what behavior to test is specific to the domain and to the technologies used. You can ask yourself, "how do I know that this thing actually works?"
Isn’t the point that internal functions often have a much smaller state space than external functions, so it’s often easier to be sure that the edge cases of the internal functions are covered than that the edge cases of the external function are covered?
So, having detailed tests of internal functions will generally improve the chances that your test will catch a regression.
> Isn’t the point that internal functions often have a much smaller state space than external functions
That's the general theory, and why people recommend unit tests instead of only the broader possible integration tests. But things are not that simple.
Interfaces do not only add data, they add constraints too. And constraints reduce your state space. You will want to cut your software over the smallest possible interface complexity you can find and test those, those pieces are what people originally called "unities". You don't want to test any high-complexity interface, those tests will harm development and almost never give you any useful information.
It's not even rare that your unities are composed of vertical cuts through your software, so you'll end up with only integration tests.
The good news is that this kind of partition is also optimal for understanding and writing code, so people have been practicing it for ages.
I agree that they would help in the regression testing process, especially in diagnosing the cause. However, I think those are usually just called "unit" tests, not "regression" tests. For instance, the internal implementation of a feature might change, requiring a new, internal unit test. The regression test would be used to compare the output of the new implementation of the feature versus the old implementation of the feature.
Worth noting that performance is an externally visible feature. You shouldn't be testing for little performance variations, but you probably should check for pathological cases (e.g. takes a full minute to sort this particular list of only 1000 elements).
For features, I need to take the time to think of required behavior. If I just focus on the implementation, the tests add no documentation and I'm not forced through the exercise of thinking about what matters.
> Aren't a huge part of tests to prevent regression?
Just a quibble: I would argue that a huge benefit of tests is preventing regression, but that's a very small part of the value of tests.
The main value I get out of tests is informing the design of the software under test.
* Tests are your straight-edge as you try to draw a line.
* They're your checklist to make sure you've implemented all the functionality you want.
* They're your double-entry bookkeeping to surface trivial mistakes.
But I think I mostly agree with your point. I delete tests that should no longer pass (because some business logic or implementation details are intentionally changing). I will also delete tests that I made along the way when they're duplicating part of a better test. If a test was extremely expensive to run, I suppose I might delete it. But in that case I would look for a way to cover the same logic in tests of smaller units.
All legitimate tests are[0] regression tests. TDD, to the extent that it's actually useful, is the notion that sometimes the bug being regression-tested is a feature request.
Edit: 0: I guess "can be viewed as" if you want to be pedantic.
> Aren't a huge part of tests to prevent regression?
Depends on the kind of tests. Old school "purist" unit tests are meant to help you verify the correctness of the code as you're writing it. Preventing regressions is better left to integration tests and E2E tests, or smoke tests. Alternatively to "unit tests" if your definition of "unit" is big enough (in which case it only works within the unit).
It's totally fine and common to write unit tests that are not meant to catch bugs of significant refactors. If you do it right, they should be so easy to author that throwing them away shouldn't matter.
Integration, E2E, and smoke tests are generally slow, flakey, hard to write. They should not cover/duplicate all the cases your unit tests cover.
They are good at letting you know all your units are wired up and functioning together. In all the codebases I've ever worked in, I would feel way more comfortable deleting them vs deleting the unit tests.
Why would you want to? when the same unit test coverage will run under 1 minute, and be smaller easier to understand/change tests and can all be done on your laptop.
it all depends on your definition of unit/integration, what I am talking about as unit tests you may very well be talking about as integration tests...
one of the main points I was making is you shouldn't have significant duplication in test coverage and if you do, I'd much rather stick with the unit tests and delete the others.
> Unit tests are generally much harder to understand and need to be changed much more frequently.
Changed more frequently, yes.
Harder to understand is usually because they're not-quite-unit-tests-claiming-to-be.
Eg: a test for function that mocks some of its dependencies but also does shenanigans to deal with some global state without isolating it. So you get a test that only test the unit (if that), but has a ton of exotic techniques to deal with the globals. Worse of all worlds.
Proper unit tests are usually just a few line long, little to no abstraction, and test code you can see in the associated file without dealing with code you can't see without digging deeper.
If you can refactor (make a commit changing only implementation code, not touching any test code) and the tests still pass then you’re probably fine.
If you’re changing tests as you change the code you’re not refactoring. You have zero confidence that your changed behaviour and changed test didn’t introduce an unintended behaviour or regression.
if you can refactor without touching your tests and your tests still compile afterwards either the refactor was extremely trivial and didn't change any interfaces or you only had end to end tests.
I think the point is that if you have to change a test to make it pass or run after refactoring, it is not useful as a regression test. By changing it you might have broken the test itself so you have less confidence.
There is also the question of what a unit is. If you test (for example) the public interface of a class as a black box unit, you can refactor your class internals as much as you want and your tests don't need to change. You have high confidence you've done it correctly. At this point adding more fine-grained tests inside the class seems like more of a compliance activity than one that actually increases confidence, since you probably would've had to change a bunch of them to make them work again anyway.
Personally the way I'd phrase it is you need to refactor your tests just like you'd refactor the app code, but even looking at doing that independent of any app code refactoring.
Agreed. I would take an even stronger position, and say that a high degree of mocking actually implies two things: First, yes, you're testing at too fine-grained a level. Second, it's a code smell that suggests you may be working with a fundamentally untestable design that relies overmuch on opaque, stateful behavior.
Mocks are worthwhile though. Otherwise you end up not being able to unittest anything which accesses an external api such as databases, rest services etc.
IMO, databases is often an integral part of the program and should be part of the test (a real database in a docker image).
For instance, if you are not relying on unique constraint in the DB to implement idempotency you are probably doing something wrong, and if you are not testing idempotent behaviour you are probably doing something wrong.
It really depends on your definition of unit. In the London school of TDD, no, a unit cannot extend across an I/O boundary. The classicist school takes a more flexible, pragmatic approach.
You mean fakes/stubs, right? Unless you're testing whether you're correctly implementing the protocol exchange with an external party, you don't need to record the API calls.
How do you test the tests that are testing your mocks? That said verifying mocks are a great help - they won't let you mock methods that don't exist on the real object.
Some mocking libraries, like the VCR library in ruby, can be turned off every now and then so you tests hit real endpoints. It is worth doing from time to time.
Bertrand Meyer had the right of it, but I had to figure this out myself before I ever saw him quoted on the subject.
Me:
Code that makes decisions has branches. Branches require combinatoric tests.
Code with external actions requires mocks.
Therefore:
Code that makes decisions and calls external systems requires combinatorics for mocks.
Bertrand, more (too?) concisely:
Separate code that makes decisions from code that acts on them.
Follow this pattern to its logical conclusions, and most of you mocks become fixtures instead. You are passing in a blob of text as an argument instead of mocking the code that reads it from the file system. You are looking at a request body instead of mocking the PUT function in the HTTP library.
The tests of external systems are much fewer, and tend to be testing the plumbing and transportation of data. If I give you a response body do you actually propagate it to the http library? And even here, spies and stubs are simpler than full mocks.
I used this strategy when developing a client library for a web socket API. It was hugely helpful. I could just include the string of the response in my tests, instead of needing a live server or even a mock server for testing. Tests were much simpler to write and faster to execute.
One would argue that you should change your string fixtures to match and verify that the new API response doesn't break anything with your existing API client. Then you change the API client and verify that all the old tests still work as expected.
Better yet is if you keep the old fixtures and the new fixtures and ensure that your API client doesn't suddenly throw errors if the API server downgrades to before the new field was added.
Yes, you should delete tests for everything that isn't a required external behavior, or a bugfix IMO.
For the edification of junior programmers who may end up reading this thread, I’m just going to come right out and say it: this is awful advice in general.
For situations where this appears to be good advice, it’s almost certainly indicative of poor testing infrastructure or poorly written tests. For instance, consider the following context from the parent comment:
Otherwise you're implicitly testing the implementation, which makes refactoring impossible.
A big smell here is if the large majority of your tests are mocked. This might mean you're testing at too fine-grained a level.
These two points are in conflict and help clarify why someone might just give up and delete their tests.
The argument for deleting tests appears to be that changing a unit’s implementation will cause you to have to rewrite a bunch of old unrelated tests anyway, making refactoring “impossible.” But indeed that’s (almost) the whole point of mocking! Mocking is one tool used for writing tests that do not vary with unrelated implementations and thus pose no problem when it comes time to refactor.
Now there is a kernel of truth about an inordinate amount of mocking being a code smell, but it’s not about unit tests that are too fine-grained but rather unit tests that aren’t fine-grained enough (trying to test across units) or just a badly designed API. I usually find that if testing my code is annoying, I should revisit how I’ve designed it.
Testing is a surprisingly subtle topic and it takes some time to develop good taste and intuition about how much mocking/stubbing is natural and how much is actually a code smell.
In conclusion, as je42 said below:
Make sure you tests run (very) fast and are stable. Then there is little cost to pay to keep them around.
The key, of course, is learning how to do that. :)
Did you ever actually refactor code with a significant test suite written under heavy mocking?
The mocking assumptions generally end up re-creating the behavior creating the ossification. Lots of tests simply mock 3 systems to test that the method calls the 3 mocked systems with the proper API -- in effect testing nothing, while baking in lower level assumptions into tests for people refactoring what actually matters.
You might personally be a wizard at designing code to be beautifully mocked, but I've come across a lot of it and most has a higher cost (in hampering refactoring, reducing readability) than benefit.
Did you ever actually refactor code with a significant test suite written under heavy mocking?
I have. The assumptions you make in your code are there whether you test them or not. Better to make them explicit. This is why TDD can be useful as a design tool. Bad designs are incredibly annoying to test. :)
For example if you have to mock 3 other things every time you test a unit, it may be a good sign that you should reconsider your design not delete all your tests.
It sounds like your argument is “software that was designed to be testable is easy to test and refactor”.
I think a lot of the gripes in the thread are coming from folks who are in the situation where it’s too late to (practically) add that feature to the codebase.
You seem to think the rationale is testing performance; but from GP it seems that the rationale is avoiding the tests ossifying implementation details against refactoring rather than protecting external behavior to support refactoring.
> Mocking is one tool used for writing tests that do not vary with unrelated implementations
What if I chose the wrong abstractions (coupling things that shouldn't be coupled and splitting things in the wrong places) and have to refactor the implementation to use different interfaces and different parts?
All the tests will be testing the old parts using the old interfaces and will all break.
The issue that takes experience here is how to determine what's a unit. "The whole program" is obviously too big. "every public method or function" is obviously too small.
Even if your code never graduates to being used by multiple teams in your project or on others, “You” can turn into “you and your mentee” anyway, if you’re playing your cards right.
Every feature of the lexer should be testable through test cases written in the syntax of the language. That includes handling of bad lexical syntax also. For instance, a malformed floating-point constant or a string literal that is not closed are testable without having to treat the lexer as a unit. It should be easy to come up with valid syntax that exercises every possible token kind, in all of its varieties.
For any token kind, it should be easy to come up with a minimal piece of syntax which includes that token.
If there is a lexical analysis case (whether a successful token extraction or an error) that is somehow not testable through the parser, then that is dead code.
The division of the processing of a language into "parser" and "lexer" is arbitrary; it's an implementation detail which has to do with the fact that lexing requires lookahead and backtracking over multiple characters (and that is easily done with buffering techniques), whereas the simplest and fastest parsing algorithms like LALR(1) have only one symbol of lookahead.
Parsers and lexers sometimes end up integrated, in that the lexer may not know what to do without information from the parser. For instance a lex-generated lexer can have states in the form of start conditions. The parser may trigger these. That means that to get into certain states of the lexer, either the parser is required, or you need a mock up of that situation: some test-only method that gets into that state.
Basically, treating the lexer part of a lexer/parser combo as public interface is rarely going to be a good idea.
For any token kind, it should be easy to come up with a minimal piece of syntax which includes that token.
There is the problem, any tests that fail in the lexer now reach down through the parser to the lexer. The test is too far away from the point of failure. I'll now spend my time trying to understand a problem that would have been obvious when the lexer was being tested directly.
>Basically, treating the lexer part of a lexer/parser combo as public interface is rarely going to be a good idea.
This is part of the original point, the parser is the public interface which is why the OP was suggesting it should be the only contact point for the tests.
Lexer/Parsers are one of the few software engineering tasks I do routinely where it's self evident that TDD is useful and the tests will remain useful afterwards.
Indeed! I recall a lexer and parser built via TDD with a test suite that specified every detail of a DSL. A few years later, both were rewritten completely from scratch while all the tests stayed the same. When we got to passing all tests, it was working exactly as before, only much more efficiently.
From that experience, I would say that in some contexts, tests shouldn't be removed unless what it's testing is no longer being used.
If you have a good answer to that, then the lexer is separate (as others said). If you don't then wirte parser tests for the lexer so that you can more easily refractor the interface between them.
There is no on right answer, only trade-offs. You need to make the right decision for you. (though I will note that there is probably a good reason parse and lex are generally separated and that probably means that the best tradeoffs for you is they are separate. But if you decide different you are not necessarily wrong)
I’ve watched this play out a few times with different teams and different code bases (eg, one team two projects).
Part of the reason existing tests lock in behavior and prevent rework/new features is that the tests are too complicated. Complicated tests were expensive to write. Expense leads to sunk cost fallacy.
I’ve watched a bunch of people pair up for a day and a half trying to rescue a bunch of big ugly tests that they could have rewritten solo and in hours if they understood them, learn nothing, and do the same thing a month later. The same people had no problem deleting simple tests and replacing them with new ones when the requirements changed.
Conclusions:
- the long term consequences of ignoring the advice of writing tests with one action and one assertion are outsized and underreported.
- change your code so the don’t need elaborate mocks
- choose a test framework that supports setup methods
- choose a framework that supports custom/third party assertions, sometimes called matchers. You won’t use this often, but when you do, you really do.
> Otherwise you're implicitly testing the implementation, which makes refactoring impossible.
Red green refactoring isn't, and shouldn't, be a goal of unit testing. Integration and E2E tests provide that. Unit tests are mostly about making sure the individual pieces works as you author them, as well as implicitly documenting the intent of those individual pieces.
If done properly, they're always quick/easy/cheap to author, and thus are throwaway. When you refactor significantly (more than the unit), you just throw them away and write new ones (at which point their only goal is for you to understand the intent of the code you were shuffling around, and making sure you're breaking what you expected to break). Delete, rewrite.
People are resistant to getting rid of unit tests when they did complex integration tests that took forever to write instead. So the tests feel like they were wasted effort. Those tests are totally valuable, in this case for things such as red green refactoring, but then yes, you have to carefully pick and choose what you're testing to avoid churn.
I would also test implementation details that are legitimately complicated and might fail in subtle ways, or where the intended behavior isn't obvious.
If I've implemented my own B+ tree, for example, you better bet your butt I'll be keeping some property tests to document and verify that it conforms to all the necessary invariants.
Otherwise you're implicitly testing the implementation, which makes refactoring impossible.
A big smell here is if the large majority of your tests are mocked. This might mean you're testing at too fine-grained a level.