Bits and needles in haystacks

Why is Stripe the most valuable YC company despite processing payments being a common everyday thing that should be commoditized at this point?

One reason is that when you look below the surface, there's a massive amount of complexity to deal with various regional rules and regulations, tons of payment standards, taxes and duties, you have to deal with fraud etc. This is the defining characteristic of the payment landscape and what makes solving it so valuable.

Solving finance seems to have that same flavor. There's a massive amount of edge cases, different ways to issue equity, different contracts, new ones written every day. There are accounting, financial modeling, rules, regulations, reporting and compliance requirements to sort out.

This is what makes solving it so valuable because all these things are painful for people who need to manage them and the solutions are complex enough that they are not commoditizable. Costly experts often have to get involved to navigate all this.

But building in complex domains is not easy.

For a tech company that deals with complexity, success depends on how quickly vast knowledge can be absorbed in its codebase, turning it into a comprehensive encyclopedia of the various problems and solutions of the domain (and nowadays, how proprietary AIs can be trained to have this knowledge).

It's scientific iterative work to discover, survey and encode this knowledge and to find the right abstractions to represent it.

I was discussing KPIs recently and it got me thinking about how progress might be measured in a product engineering process.

Measuring engineering productivity has always been a challenge. A widely despised starting point was lines-of-code based measures. But why are LOC measures so bad?

The original intuition behind using LOC was not totally misguided, especially for a complex domain where you want to measure how much expertise has been absorbed. LOC, especially LOC of automated tests, once you've removed cruft, once you've optimized abstractions, is related to the amount of knowledge the codebase has accumulated and the amount of problems solved. It's a bit akin to measuring the size of your encyclopedia. But it's still a bad measure.

I think a good KPI for software engineering might look like this:

The number of bits of, universal, proprietary, non-overfitted, non-underfitted, non biased, explanatory knowledge related to users' problems and their valuable solutions having been encoded into the code base.

Maximizing the accumulation of this type of knowledge is key.

All this is unfortunately difficult to measure. But familiarity with the concepts involved can help perform subjective assessments.

So what do we mean by a "bit" here? What does universal mean? How do the concepts of bias and "fit" play a role?

Let's start with the bit. What are bits in this context? It is said Claude Shannon invented the bit. He at least defined what it means in the context of knowledge transmission and communication but let's start with an example.

How do you find a needle in a haystack?

If you're a software engineer, you might use some kind of binary process, splitting the haystack in two and splitting the resulting haystacks in two again, recursively repeating this process until the piles are small enough that you are able to easily see the needle in a pile.

The number of times you have to split the haystack to find the needle is the number of bits needed to solve this problem.

Bits are a measure of how much you have to narrow things down to find a solution.

A haystack is a great intuitive illustration of a maximum entropy or an "ignorance prior" situation, a state of unknowledge. The number of times you have to split the search space to find the solution, is the best quantification we have for the knowledge created by solving the problem. If your problem space has many dimensions that interact together, finding the solutions can require a lot of divide and conquer.

In some sense, the more problems feel like finding needles in haystacks, the more value you provide by solving them.

How does this apply more specifically to software development?

Imagine arithmetics had not been invented yet, at least not implemented in computers, and you were trying to solve the problem: what is a + b for non negative integers. You wanted to write a program that could solve this.

How do we quantify the bits of knowledge that can be created by solving this problem?

Bayesian probability theory can essentially be summarized by this: You can't reason properly about information without taking into account the prior, the original state of knowledge (the ignorance state, if that's where you're starting from). Knowledge creation is always relative to a prior state. [1]

So if you want to quantify how much information you are solving, it's useful to write down an ignorance state to start from.

Maybe you know that the result of a+b for non negative integers is a number between 0 and infinity. But maybe you only care about solutions that can be held in an integer on your computer and so you can narrow down your prior to a+b being between 0 and maximum integer size.

An implementation that represents your initial ignorance state might look like this:

func addition(a,b){
  return math.random() * MAX_INTEGER
}

One thing to notice here is that this implementation has very low program entropy or description length, the code is short, it could be represented very compactly eg. if you were using a kind of scientific notation, just return a value with zero digits of precision. But it has a high entropy of return value. There's MAX_INTEGER potential solutions to an addition, this program has a p=1/MAX_INTEGER chance of getting it right. [2]

With 32 bit integers, the number of bits of uncertainty is 32. The binary log of the probability is -32, the entropy is 32 bits. [3]

Then the question becomes, how can we improve on this program?

Well in order to solve a + b you started doing some research and gathering some evidence. Maybe you mixed bags of marbles with known counts and counted the total in the mixed bag. You then built a test suite that represents the values you have knowledge about and you want to solve for.

func testAddition(){
   assert(addition(2,2),4)
   assert(addition(20,10),30)
   assert(addition(9999,1)10000)
  ...
}

This gathered knowledge can be considered training data for the implementation. Ideal tests are a good sample of what's important for users. If assertions are also non redundant, non biased, independent samples of the knowledge of the problem-solution space, this makes them a good quantification of that knowledge (think in terms of the i.d.d. assumption in statistics). [4]

And so with the implementation above, each assertion would have on average a 32 bit log error. For the entire knowledge worth solving represented in these tests, the test failures would add 32 bits times the number of assertions to the entropy of the program. [5]

Based on these tests, how can we improve the addition function?

Well you could have an implementation that looks like this:

func addition(a,b){
  if(a==2 && b==2){
     return 4
  }
  if(a==20 && b==10){
	return  30
  }
  if(a==9999 && b==1){
	return 10000
  }
  ...
  return math.random() * MAX_INTEGER
}

This passes your tests. You've reduced the entropy of the error, of the difference between the output and correct solutions. If your KPI is LOC you're doing great here.

There are however obvious downsides to this approach. It's lengthy code. There's a lot of surfaces for potential mistakes. Essentially there's high Description Length ( the program is far from the Kolmogrov complexity of the solution). This code also does not generalize to values outside the exact ones encoded in the function and present in the tests.

The entropy of the error has been lowered but the entropy of the logic itself has been raised since the code is long. These are both measured in units of bits, and can be added up. [6]

In the implementation of this logic, the entropy of the implementation has grown by just about as much as the entropy of the error has been reduced in tests. [7] This version can be considered another benchmark, it's better than the first implementation of a random answer, but it overfits the test about as much as possible. It doesn't generalize at all. It has no amount of universality.

This 1:1 ratio in reduction and increase in the two types of entropy is sometimes used as a minimum threshold for algorithms that attempt to reduce overfitting. Anything worse than this, more lengthy than this implementation, would have been rejected and culled away immediately.

And it's possible to do much better. For example define a single bit addition with carry.

Func addBit(a,b,carry){ //returns result and carry
  If(a==0 && b==0, && carry==0){
     Return 0,0
}
  if (a==0 && b=0 && carry==1){
	Return 1,0
 }
if (a==0 && b=1 && carry==0){
	Return 1,0
 }
if (a==0 && b=1 && carry==1){
	Return 0,1
 }
if (a==1 && b=0 && carry==0){
	Return 1,0
 }
if (a==1 && b=0 && carry==1){
	Return 0,1
 }
if (a==1 && b=1 && carry==0){
	Return 0,1
 }
if (a==1 && b=1 && carry==1){
	Return 1,1
 }
}

And then use this function to iterate and add the bits one by one

Func addition(a,b){
    Let carry =0
    Let result=0
    For i=0 to 32 {
		result[i],carry=addBit(a[i],b[i],carry) //brackets get a specific bit
    }
    return  result
}

This is a smaller implementation. So small that it is pretty much always implemented directly in silicon. And on top of passing all tests in our samples, it is likely to generalize to many other values of a and b (at least those that don't result in an integer overflow). For each of the assertions, you've reduced entropy of the error by 32 bits down to zero compared to the first implementation. But the length in bits of the implementation remains small compared to the number of bits that can be solved in the test suite. You've reduced entropy of the error while minimizing the increase in entropy (relative to the math.random() * MAX_INTEGER implementation) from the implementation itself. Your net entropy is much better.

You could have millions of values in your tests and the addition program implementation would not grow. The entropy of test errors with the Math.random() implementation would have been 32 multiplied by those millions of assertions, the error reduction with the new implementation would have been as much as that but the entropy of the code itself, the number of bits added to the implementation would have stayed small.

That ratio of corrected bits in tests to bits it took to implement the solution would be very large. This is the universality we are looking for.

MDL principles say that finding the shortest program is equivalent to finding the one that generalizes best to untested and unknown cases. It gets you the best abstractions.

A team recently leveraged this to solve ARC-AGI problems without any training data.

Minimum description is a very useful property and it's why LOC measures that promote writing long programs are so detrimental.

The fundamentals behind this are also known as Solomonoff induction.

If you are dealing with complex knowledge, knowledge where you often can't comprehensively test every edge cases. What you can do is build a good sample of problems and solutions and then try to find abstractions, program structures that reduce implementation size (and size of tests). The implementations you find after minimizing the code length are then more likely to generalize beyond your known good values and solve the problems more broadly.

These ideas about knowledge are very general, UI design for example, is also about finding the simplest, shortest flow and abstractions that can explain to the users how to get to their solutions.

Since this is all based on information theory and bayesian principles it's very broadly applicable. The intro to Claude Shannon book states:

The word communication will be used here in a very broad sense to include all of the procedures by which one mind may affect another. This of course, involves not only written and oral speech, but also music, the pictorial arts, the theater, the ballet and in fact all human behavior. In some connections it may be desirable to use a still broader definition of communication, namely, one which would include the procedures by means of which one mechanism affects another mechanism.

I also added the "proprietary" qualifier above. This is based on Schumpeterian economics that says widely available knowledge is insufficient to make a profit since margins will get competed away by businesses that also have it.

Once a product has absorbed all the knowledge that is available online (or nowadays available via LLMs and such), you have to absorb further expertise that will differentiate your offering from competitors (ideally expertise that is difficult for them to get their hands on).

This post expounds on some concepts from the code entropy post.

Acknowledgements: I'd like to thank Vlad Iovanov for feedback on this post.

[1] Measures of entropy in bits such as KL-divergence follow this rule. The prior state is sometimes somewhat subjective which means it's not always easy to calculate an exact entropy but practically this doesn't matter. With different weak priors, you'll only be off by a small constant that you can ignore anyways when comparing different implementations.

[2] or if you interpret it as monte carlo simulation of an underlying distribution, it assigns 1/MAX_INTEGER probability to the right value (or any other integer value), in another programing language such as R you might specify the value uniformly distributed from 0 to MAX_INTEGER. With scientific notation you might write something along the line of having zero digits of precision. These are different ways to represent the same thing. I find the monte carlo version more intuitively appealing for programmers.

[3] technically negative log prob and entropy are different but they are the same for a uniform distribution and log probs are a great starting point to think about entropy

[4] If you want correct quantification of knowledge, and possibility for correct entropy calculations, the number of assertions in tests theoretically matters. They should match the evidence you have about the domain. If you have redundant assertions for example, this will count when you are counting log error reductions and weighting them against code entropy and may bias your abstractions toward solving for the redundant assertions instead of abstraction that balance different concerns and knowledge more correctly. This might not matter much practically, because humans often rely on knowledge in their heads instead of in tests. It matters more for the training of systems that generate code automatically such as LLMs but there's benefits even for human software devs. A representative test suite leads to better abstractions and code that generalizes better to future or unknown use cases. Redundant assertions or assertions that test for specific implementation details irrelevant to users may ossify the code base and prevent it from evolving towards the optimal abstractions. Technically, the number of assertions can be seen as bayesian hyperparameters that best represent the knowledge. For example, if you are trying to estimate how biased a coin is, by counting the number of flips for heads and tail, having seen 2 and 2 of each or 100 and 100 of each gives you the same maximum likelihood estimate of a 50% unbiased coin but the 200 flip experiment gives you a lot more certainty that this estimate is not going to change as you get more samples. This type of information can be taken into account when choosing how best to represent a coin. As you get more samples if you start seeing a clear bias in the data, maybe you no longer represent the coin with a hardcoded .5 probability but you add a configurable parameter for the bias. Software developers make these types of decisions all the time without even noticing as they build code abstractions. It is legitimate to deliberately abuse statistics here if you have reasons to want to oversample some values because you know it's very valuable to get these particular values right. It's a legitimate manipulation of the hyper-parameter space.

[5] entropy of error is how many bits of data on average you would have to add to correct the error of your program. How far in bits are you from the correct answer at the resolution you want it.

[6] technically the true minimum description of implementation logic in bits would be closer to compiled and compressed logic, not high level programming language length.

[7] I'm hoping I got the MDL calculations right here. Technically it's about calculating the length of the compiled and compressed code that generates the inputs and outputs in the tests. It doesn't include the code to verify those outputs but it does include any code that would be needed to correct errors so for a magnitude 8 error, you have to add 3 bits for the 3 bit number that needs to be added.