Integration tests are the best

This is a followup post to the code entropy post. It's going to focus on test granularity and organizing code to scale teams. A few counter arguments came up to the idea of promoting hierarchically organized tightly scoped code.

First, people seem nervous about letting go of the idea of splitting things up so that everything can be built and tested and reviewed in isolation with "well defined contracts" between components.

The second argument I heard was that yeah, maybe reducing entropy and dependency lengths makes the code more optimal in a theoretical sense but the slight theoretical optimization is not worth sacrificing other more practical aspects such as having an architecture that works for larger teams of engineers. I'll also address this.

There are not necessarily super clear black and white answers to these concerns. All patterns being discussed are valid in some contexts. The problem is when they are applied blindly to the wrong contexts. As usual, one important goal is to avoid overfitting on a particular pattern.

Letting go of breaking code into smaller units

The first counterpoint directly ties into debates about unit tests versus integration tests. Inlined logic basically can't be tested in small units and is forced to be tested as part of larger integrations.

I'm not saying testing parts in isolation is bad. It's often good. But It's important to notice that there are real tradeoffs involved.

This image has been making the social media rounds lately:

It's a bit of a Rorschach test of engineering philosophy but in this context, you can see a design on the left having separate components, with the benefits of being able to replace and test each part in isolation and on the right there's a 3D printed design where components are nested inside each other removing the need for interconnections and simplifying and streamlining the design to its essential parts. It's 21% more powerful and likely more reliable but it sacrifices things like repairability and the ability to test components in isolation or designing them independently of each other.

It's my understanding that although it may look like it, components, except for connections, were not removed in Raptor 3, they were simply nested and embedded in each other. The tradeoffs are quite visible.

Now this is mechanical design, maybe, you could say, it doesn't apply as a metaphor for software. And it's true that things are a bit more subtle with software.

But you could also argue that it's even more important to nest things in software because otherwise there tends to be more premature reuse of components. Imagine Raptor 1 but where the outside component designs are automatically reused in other rocket engines so that when the design is changed to fit the needs of one engine, other engines would be immediately affected in unpredictable ways.

One commonly cited reason for testing smaller units in isolation with well defined contracts and values as boundaries of tests, is that as you combine components together, the states multiply and there is an explosion of possible states and it's very difficult to test every state or even test a significant portion of them.

Testing in smaller units reduces the potential number of states, especially with smaller units and you can get a higher coverage of the states of the unit. It reduces some testing redundancy you would get from testing with different configurations of upstream component states. Test performance is also often better.

This is true and is one of the reasons why it's sometimes worth testing units in isolation.

However this has to be weighted against significant downsides.

In the last post, I mentioned briefly the challenge of keeping tests representative of the users' concerns and not overfitted to irrelevant details.

In complex domains there is, regardless of the approach, a vast number of states to test and it's difficult to achieve coverage of a high percentage of states. Even if you have high line-of-code coverage, this is just the beginning. Have you tested every line with enough different values?

Unless you have truly massive engineering resources, you'll be forced to test a representative sample of users' concerns. As mentioned in my last post, you can maximize the benefits of these tests by building good abstractions such that those tests interpolate and generalize well to other untested or unknown values.

Breaking code out into small units can sometimes prevent building those well fitted abstractions.

Integration tests will allow you to improve internals, more easily refactor how smaller units fit together, without having to get rid of or rewrite tests and with the safety and protection of having those representative tests in place on the outside. This promotes evolution of internals.

With representative integration tests, if one sub component changes in a way that affects what is sent to another sub-component and that second sub component didn't have test coverage for those new values, the new value will automatically flow through the second sub component. The integration test can catch cross component regressions automatically whereas with unit tests you might have to manually add more coverage to the second unit you are not working on in order to reflect the new values the first component will be sending to it. I find it's very rare that an engineer will bother to do that, so often the regressions on cross feature interactions are missed.

That is, when you test smaller pieces you have to define the inputs and outputs of each small piece, potentially many more things to define. It's sometimes (not always) easier to define inputs and output values of larger units and let the values flow through and automatically test multiple smaller units. Then when the tests are adjusted to reflect new realities, all the inner pieces' inputs will automatically be adjusted.

Another thing to consider about tests is that asserting on something that users don't care about is a mistake. If that test fails later, it will be a false positive failure.

When there are multiple ways of implementing something and it doesn't matter to the user which one you pick, it's a good idea to avoid tests that assert on a specific way. Tests should represent what users care about. Otherwise the overfitted test suite gets noisy and code gets difficult to refactor and improve.

An overfitted test suite also means that it's simply a poor description of the requirements, a poor embodiment of our state of knowledge about the domain.

Smaller unit tests have more of a tendency to overfit to implementation details instead of user requirements.

Some say unit tests are beneficial because they allow refactoring by discarding old unit tests. But isn't it better to have tests that can keep doing their part through a refactoring?

If you've been reading my other posts you've seen I like to judge code similarly to how we might evaluate ML models.

You can compare how code evolves to how neural networks are trained.

Neural nets are trained by presenting inputs and outputs to their top and bottom layers and gradually evolving all the internal weights using back propagation to fit the outer data.

If during the training process at some point, instead of letting the knowledge back propagate from the outside, the weights of the inner layers were asserted to specific values, it might prevent the inner weights from evolving to their optimal values.

Neural networks nowadays are often used to learn abstract language models that are not unlike computer programs. This metaphor is rooted in something that is becoming reality.

Asserting on internals can lead to a kind of abstraction sclerosis.

Metaphorically, the code might get stuck in Raptor 1 design and be difficult to evolve into the streamlined Raptor 3.

Lack of flexibility to refactor and improve the code is also often blamed for ending up with a "big ball of mud" architecture.

To summarize the ML metaphor, test cases can be seen as a training set. They should be a good representation of the requirements of the domain from the users' perspective and by default, it should be easy and safe to improve the internals with minimal changes to tests and while having the safety of tests for catching regressions. Integration tests are best at this.

Another argument against smaller unit tests is that intermediate unit states are sometimes not even well defined. I've seen accounting software requirements where accounting firms don't agree on the best methodologies to do intermediate calculations. As long as final results are approximately correct they will pass audits. Asserting on intermediate results here is a clear case of overfitting.

You might notice that the arguments above might also apply to end-to-end tests. Why not emphasize these? Sometimes you'll want them, but while the goal here was to point out the downsides of too fine a granularity there are also practical downsides to too large units especially related to test run times. As always, the goal is to achieve the best fit to needs and integration tests tend to lie in that happy middle.

Weighting concerns of tightly nested components vs architecturing for larger teams

The second counterpoint that I heard was that having information theoretically optimal code was not as important as other concerns such as allowing for greater team sizes.

Now assessing the importance of each concern is not easy and a bit of a judgment call.

I think the benefits of reducing code entropy are significant. One reason it might be so powerful is that it's a concept that applies recursively. The effect really adds up when it's applied from the large trunks to the tiniest branches of the software architecture.

There are also important software historical examples we can learn from.

This is the story of Unix (and Linux by extension).

Unix was an effort to take Multics, an operating system that had gotten too modular, and integrate the good parts into a more unified whole (book recommendation).

Even though there were some benefits to the modularity of Multics (apparently you could unload and replace hardware in Multics servers without reboot, which was unheard of at the time), it was also its downfall. Multics was eventually deemed over-engineered and too difficult to work with. It couldn't evolve fast enough with the changing technological landscape. Bell Labs' conclusion after the project was shelved was that OSs were too costly and too difficult to design. They told engineers that no one should work on OSs.

Ken Thompson wanted a modern OS so he disregarded these instructions. He used some of the expertise he gained while working on Multics and wrote Unix for himself (in three weeks, in assembly). People started looking over Thompson's shoulder being like "Hey what OS are you using there, can I get a copy?" and the rest is history.

Brian Kernighan described Unix as "one of" whatever Multics was "multiple of". Linux eventually adopted a similar architecture.

The debate didn't end there. The Gnu Hurd project was dreamed up as an attempt at creating something like Linux with a more modular architecture with better defined interfaces between sub components of the kernel (Gnu Hurd's logo is even a microservices like block diagram).

However, it is Unix and Linux that everyone carries in their pockets nowadays, not Multics and Hurd. The more integrated approach allowed the OS to evolve and improve much faster. Almost all alternative OSs have been filtered out by the markets.

This story is an entertaining arc of tech history. Apple's adoption of Unix was part of quite dramatic, almost messianic events. During the second coming of Steve Jobs, he came down and got rid of classic MacOS, throwing out a huge chunk of Apple's intellectual property and of its history and painfully migrating everyone using Macs to Unix as part of his efforts to help right a sinking ship.

Just before that, according to Jobs, Microsoft was also rushing to adopt Unix patterns in Windows with the Windows NT project so there was a bit of a race for who was going to have the best Unix architecture first. In the video, we get the typical Jobsism, not mincing words, saying Microsoft would fail to complete migration to Unix and would remain "the worst development environment that has ever been invented".

These threads of computing history provide clues to the importance of streamlined architecture. Systems that had matured into a more componentized architecture had to walk back and re-monolithize to be successful.

The Linux example also provides evidence that the more integrated approach actually helps with enabling high number of contributors. There is likely little tradeoff here. Linux might be the code base that has had the most hands involved in the history of computing.

Is it likely that the entropy reduction, the simplifications and the streamlining allows more developers to jump in and understand what is going on, to have engineers more easily come in and take ownership of an improvement project without having to coordinate too many disparate parts and teams. The tighter, hierarchical scoping means more readability of what is affected when something is changed so improvements can be made with confidence. The tighter scopes also mean limiting any breakage to smaller areas of the project. More is achievable with less code in a single PR that can be submitted and reviewed more holistically for inclusion in the mainline kernel, so that all the implications are visible in one patch and there's less having to search other diffs and code areas to see how the changes might affect other pieces.

Acknowledgements: I want to thank Aaron Lefkowitz, Chase Wackerfuss and Marcos Lehmkuhl for giving me feedback on this post.