Integration tests are the best

This is a followup post to the code entropy post. It's going to focus on test granularity and organizing code to scale teams. A few counter arguments came up to the idea of promoting hierarchically organized tightly scoped code.

First, people seem nervous about letting go of the idea of splitting things up so that everything can be built and tested and reviewed in isolation with "well defined contracts" between components.

The second argument I heard was that yeah, maybe reducing entropy and dependency lengths makes the code more optimal in a theoretical sense but the slight theoretical optimization is not worth sacrificing other more practical aspects such as having an architecture that works for larger teams of engineers. I'll also address this.

There are not necessarily super clear black and white answers to these concerns. All patterns being discussed are valid in some contexts. The problem is when they are applied to the wrong contexts. As always, as a Bayesian, one important goal is to avoid overfitting on a particular pattern.

Letting go of breaking code into smaller units

The first counterpoint directly ties into debates about unit tests versus integration tests. Inlined logic basically can't be designed and tested in small units and is forced to be tested as part of larger integrations.

I'm not saying testing parts in isolation is bad. It's often good. But It's important to notice that there are real tradeoffs involved when choosing granularity of components.

This image has been making the social media rounds lately:

It's a bit of a Rorschach test of engineering philosophy but in this context, you can see a design on the left having separate components, with the benefits of being able to replace and test each part in isolation and on the right there's a 3D printed design where components are nested inside each other removing the need for interconnections and simplifying and streamlining the design to its essential parts. It's 21% more powerful and likely more reliable but it sacrifices things like repairability and the ability to test and design components in isolation independently of each other.

It's my understanding that although it may look like it, components, except for connections, were not removed in Raptor 3, they were simply nested and embedded in each other. The tradeoffs are quite visible.

Now this is mechanical design, maybe, you could say, it doesn't apply as a metaphor for software. And it's true that things are a bit more subtle with software.

But you could also argue that it's even more important to nest things in software because otherwise there tends to be more premature reuse of components. Imagine Raptor 1 but where the interconnected component designs are automatically reused in other rocket engines so that when the design is changed to fit the need of one engine, other engines would be immediately affected in unpredictable ways.

One commonly cited reason for having smaller isolated units with well defined contracts and values as boundaries of tests, is that as you combine components together, the states multiply and there is an explosion of possible states and it's very difficult to test every state.

Testing in smaller units reduces the potential number of states, especially with smaller units and you can get a higher coverage of the states of the unit. It reduces some testing redundancy you would get from testing with different configurations of upstream component states. Test performance is also usually better.

This is true and are some of the reasons why it's sometimes worth testing units in isolation.

However these have to be weighted against significant downsides.

In the last post, I mentioned briefly the challenge of keeping tests representative of the users' concerns and not overfitted to irrelevant details.

In complex domains there is, regardless of the approach, a vast number of states to test and it's difficult to achieve coverage of a high percentage of states. Even if you have high line-of-code coverage, this is just the beginning. Have you tested every line with enough different values?

Unless you have truly massive engineering resources, you'll be forced to test a representative sample of users' concerns. You can maximize the benefits of these tests by building good abstractions such that those tests interpolate and generalize well to other untested or unknown values.

Breaking code out into small units can sometimes prevent building those well fitted abstractions.

Good integration tests will allow you to improve internals, easily refactor how smaller units fit together, without having to get rid of or rewrite tests and with the safety and protection of having those representative tests in place. This promotes improvements of internals and of how the parts fit together.

With representative integration tests, if one sub component changes in a way that affects what is sent to another sub-component and that second sub component didn't have test coverage for those new values, the new value will automatically flow through the second sub component. The integration test can catch cross component regressions automatically whereas with unit tests you might have to manually add more coverage to the second unit you are not working on in order to reflect the new values the first component will be sending to it. I find it's very rare that an engineer will bother to do that, so often the regressions on cross feature interactions are missed.

That is, when you test smaller pieces you have to define the inputs and outputs of each small piece, potentially many more things to define. It's sometimes (not always) easier to define inputs and output values of larger units and let the values flow through and automatically test multiple smaller units. Then when the tests are adjusted to reflect new realities, all the inner pieces' test inputs will automatically be adjusted.

When there are multiple ways of implementing something and it doesn't matter to the user which one you pick, it's a good idea to avoid tests that assert on a specific way. Tests should represent what users care about. Otherwise the overfitted test suite gets noisy and code gets difficult to refactor and improve.

An overfitted test suite also means that it's simply a poor description of the requirements, a poor embodiment of our state of knowledge about the domain.

Smaller unit tests have more of a tendency to overfit to implementation details instead of user requirements.

Some say unit tests are beneficial because they allow refactoring by discarding old unit tests. But it's often better to have tests that can keep doing their part through a refactoring.

If you've been reading my other posts you've seen I like to judge code similarly to how we might evaluate ML models.

You can compare how code evolves to how neural networks are trained.

Neural nets are trained by presenting inputs and outputs to their top and bottom layers and gradually evolving all the internal weights using back propagation to fit the outer data.

If during the training process at some point, instead of letting the knowledge back propagate from the outside, the weights of the inner layers were asserted to specific values, it might prevent those inner weights from evolving to their optimal values.

Neural networks nowadays are used to learn abstract language models like computer programs. This metaphor is rooted in something that is reality.

Asserting on internals can lead to a kind of abstraction sclerosis.

Metaphorically, the code might get stuck in Raptor 1 design and be difficult to evolve into the streamlined Raptor 3.

There is another important concept from ML called "differentiability". Neural nets are built on a differentiable matrix of weights and the differentiability is what allows them to gradually converge towards the best solution by sliding weights along a slope.

Analogously, for code, differentiability is how easy it is to refactor, split, recombine peices, or generally kinda slide the program into its optimal abstractions through iterative improvements. This is easier when scopes are narrow, when logic is kept local and when tests are not too granular.

While most programs can't achieve true differentiability, you can't slide into the optimal program without doing discrete jumps, you can minimize the size of the required jumps and reap some of the benefits of differentiability.

Some programs can get close to true differentiability, for example, if they're pure floating point equations, like neural net weights. It's possible to leverage autodiff to adjust internals based on derivatives of output or error with respect to those internal parameters. That's why LLMs can be trained. The lessons that prevent LLMs from getting stuck in training, are applicable to manually written code bases.

Ultimately even sliding floating point parameters by following the derivative is not "true" differentiation. Computers have the bit as their smallest quanta so it's all small jumps. It's the best we can do in the digital world.

Another piece of evidence for the benefits of enabling small local jumps is that monte carlo methods that don't rely on sliding things into place also benefit from a space that is explorable via small jumps. It's easier for these algorithms to avoid getting stuck in local optima. It helps avoid poor mixing.

These are some of the theoretical reasons I like an architecture that enables easy incremental refactoring of the different pieces. Keeping the code base agile in this way avoids getting stuck in Big Ball of Mud.

To summarize the ML metaphor, test cases can be seen as a training set. They should be a good representation of the requirements of the domain from the users' perspective and by default, it should be easy and safe to improve the internals with minimal changes to tests and while having the safety of tests for catching regressions. Integration tests are best at this.

Another argument against smaller unit tests is that intermediate unit states are sometimes not even well defined. I've seen accounting software requirements where accounting firms don't agree on the best methodologies to do intermediate calculations. As long as final results are approximately correct they will pass audits. Asserting on intermediate results here is a clear case of overfitting.

You might notice that the arguments above might also apply to end-to-end tests. Why not emphasize these? Sometimes you'll want them, but while the goal here was to point out the downsides of too fine a granularity there are also practical downsides to too large units especially related to test run times. As always, the goal is to achieve the best fit to needs and integration tests tend to lie in that happy middle.

Weighting concerns of tightly nested components vs architecturing for larger teams

The second counterpoint that I heard was that having information theoretically optimal code was not as important as other concerns such as allowing for greater team sizes.

Now assessing the importance of each concern is not easy and a bit of a judgment call.

There are also important software historical examples we can learn from.

This is the story of Unix (and Linux by extension).

Unix was an effort to take Multics, an operating system that had gotten too modular, and integrate the good parts into a more unified whole (book recommendation).

Even though there were some benefits to the modularity of Multics (apparently you could unload and replace hardware in Multics servers without reboot, which was unheard of at the time), it was also its downfall. Multics was eventually deemed over-engineered and too difficult to work with. It couldn't evolve fast enough with the changing technological landscape. Bell Labs' conclusion after the project was shelved was that OSs were too costly and too difficult to design. They told engineers that no one should work on OSs.

Ken Thompson wanted a modern OS so he disregarded these instructions. He used some of the expertise he gained while working on Multics and wrote Unix for himself (in three weeks, in assembly). People started looking over Thompson's shoulder being like "Hey what OS are you using there, can I get a copy?" and the rest is history.

Brian Kernighan described Unix as "one of" whatever Multics was "multiple of". Linux eventually adopted a similar architecture.

The debate didn't end there. The Gnu Hurd project was dreamed up as an attempt at creating something like Linux with a more modular architecture with better defined interfaces between sub components of the kernel (Gnu Hurd's logo is even a microservices like block diagram).

However, it is Unix and Linux that everyone carries in their pockets nowadays, not Multics and Hurd. The more integrated approach allowed the OS to evolve and improve much faster. Almost all alternative OSs have been filtered out by the markets.

This story is an entertaining arc of tech history. Apple's adoption of Unix was part of quite dramatic, almost messianic events. During the second coming of Steve Jobs, he came down and got rid of classic MacOS, throwing out a huge chunk of Apple's intellectual property and of its history and painfully migrating everyone using Macs to Unix as part of his efforts to help right a sinking ship.

Just before that, according to Jobs, Microsoft was also rushing to adopt Unix patterns in Windows with the Windows NT project so there was a bit of a race for who was going to have the best Unix architecture first. In the video, we get the typical Jobsism, not mincing words, saying Microsoft would fail to complete migration to Unix and would remain "the worst development environment that has ever been invented".

Nowadays Jobs' chosen abstractions endure. If you develop for iPhone you see the NS namespace everywhere in component libraries, a reference to NeXtSTEP Jobs' variant of Unix.

These threads of computing history provide clues to the importance of streamlined architecture. Systems that had matured into a more componentized architecture had to walk back and re-monolithize to be successful.

The Linux example also provides evidence that the more integrated approach actually helps with enabling high number of contributors. There is likely little tradeoff here. Linux might be the code base that has had the most hands involved in the history of computing.

Is it likely that the entropy reduction, the simplifications and the streamlining allows more developers to jump in and understand what is going on, to have engineers more easily come in and take ownership of an improvement project without having to coordinate too many disparate parts and teams. The tighter, hierarchical scoping means more readability of what is affected when something is changed so improvements can be made with confidence. The tighter scopes also mean limiting any breakage to smaller areas of the project. More is achievable with less code in a single PR that can be submitted and reviewed more holistically for inclusion in the mainline kernel, so that all the implications are visible in one patch and there's less having to search other diffs and code areas to see how the changes might affect other pieces.

Optimizing abstractions for independent isolated components often means deoptimizing for the design as a whole. Advanced engineering whether it's rocket engines, refined sports cars or software like Unix often means tailoring parts to fit perfectly into each other instead of connecting generic reusable parts. This is also proven by Bayesian information theoretic principles.

Acknowledgements: I want to thank Aaron Lefkowitz, Chase Wackerfuss and Marcos Lehmkuhl for giving me feedback on this post.