Benoit Essiambre

Leveraging the lightcone around the source of truth with postgres

So we discussed software abstraction, tying the concept to probabilistic bayesian models. We argued that through simplification and reduction of description length, reduction of program sizes, abstraction can allow programs to better generalize and interpolate into the unknown. We dove into code entropy with a focus on the downsides of broad scopes. I then followed up with a post about software componentization, maybe wading into controversy, by pointing out the downsides of breaking software into smaller testable units.

For this post, I'm going to start by acknowledging that of course, there's code that benefits from being split into independent components abstracted away, tested in isolation and mocked in other tests. Some say that the database should be treated this way. However, I’m going to be further controversial and advocate for the opposite, for postgres-centrism, for keeping the relational db a tightly coupled core component of software.

While databases are often perceived primarily as persistent storage, the true role of a relational database is managing the source of truth amid concurrent access. It's a coordination layer. The fact that persistent storage has to happen at the same place to be reliable is only incidental, not a defining attribute of a relational database.

While discussing the concepts of code entropy, minimum description and dependency distances in previous posts, I briefly mentioned that there's implications from physics for the physical manifestation of computation.

These concepts might not only apply to how logic is organized in code but to how information is laid out physically. This is in-line with the insight of Maxwell's Daemon, the thought experiment that highlighted the equivalence between thermodynamic entropy and information entropy. The equivalence is not surprising. Information ultimately has to be made of physical particles to exist.

The physical layout of information is particularly relevant for problems that involve data locality which comes up with source of truth coordination and transactional requirements.

Information and computation are best laid out hierarchically in tightly scoped trees of nested code and data. This lowers entropy and optimizes abstractions for generalization.

It might also be the optimal layout for physical bits being processed, packed in concentric perimeters of distance minimized caches around the physical CPU. Having the source of truth close by, makes possible logic necessary for transactions and coordination.

Thermodynamically, reducing code length also means reducing energy spent executing the code.

The divine coincidence, the equivalence between good abstraction and low entropy might have been key for the evolution of intelligence and might be responsible for LLMs being smarter than people can explain.

Humans seem to have evolved some of the most advanced common sense mechanisms to reason about uncertainty. Computers, at least pre-llm computers often lacked this. They were often awkward or nonsensical when uncertainty was involved.

Computation going beyond processing point estimates and discrete answers and getting to combining and weighting uncertainty around the various aspects of problems is crucial for implementing correct reasoning.

Statistics software like the R programming language, can do this but it takes a lot of computation. Probability distributions, their parameters, and maybe hyperparameters have to be defined and often simulated with monte carlo algorithms. The structure of the models have to be manually laid out.

You might think that properly tracking uncertainty would always require more computational resources than simple point estimates but surprisingly, this doesn't seem to be the case. Minimum description principles, on top of optimizing abstractions, allow for tracking uncertainty by carefully culling parts of programs instead of adding more parameters and more math.

This is how scientific notation works. Uncertainty is not tracked by keeping a separate confidence interval for values, but by simply omitting digits in numbers.

It's fairly straightforward when it's about plain numbers. When it comes to defining structures of models and programs, tracking inter-dependencies, memory address references, "attention" in LLMs, and taking into account the distances of these references, the benefits of culling bits apply in more complex ways but it still does seem to apply.

It's possible that both LLMs and human brains were able to evolve to track and operate over uncertainty relatively easily because this task is aligned with conserving energy, a tendency that was important anyways, to save electricity or calories. It's wasteful to spend bits in GPUs or use up neurons to track information (including information about structure of models), more precisely than warranted by the data (for example, bit depth for llm weights have gone down over time from like 32 or 16 sometimes down to less than 2 bits ).

Optimizing by culling bits, seems to make intelligence almost magically emerge out of some systems.

And so here's an hypothesis, and I can't say I'm 100% on this, it's somewhat speculative, that when code and data are structured along these lines, both for pure abstract code but also for the physical configuration of data, keeping things hierarchically organized close to the coordinated source of truth, more intelligence will also kind-of emerge out of software.

This means the software will be more predictive in the face of uncertainty, in the face of the unknown, work more along the lines of causal reasoning, be more in line with a kind of occam's razor bayesian type reasoning that looks like common sense to users, not to mention the code itself will be easier to read and reason about by engineers, more predictable more robust in the face of changes and improvements and more representative of the state of knowledge about the domain.

You may have heard this joke based on the "This meeting could have been an email" meme: "this microservice could have been a join". When you unnecessarily split logic out and away from the db, you add entropy, ways for things to go wrong. It's possible to add tests to mitigate this entropy but what does that really accomplish? It's digging a hole and filling it back up.

DB side joins (and other kinds of db logic) tend to have less possibility for error. It's never necessary to test each join independently, you might test the whole query instead. This simplicity and reduction of error cases is often only possible with locality to the source of truth. This reduction in degrees of freedom is important for good abstraction. In Bayesian terms, it raises the prior of correct models.

Because data locality is exploited, logic performed inside the db instance avoids important classes of errors. If instead, for example, data was joined away from the db after querying tables separately, you would no longer get an atomic, consistent view of the data (even if done within a transaction, unless the isolation level was raised). A commit could have happened on the second table in the time in between the two queries. It gets quite tricky to guarantee consistency outside the db. You hit the limits of the CAP theorem and a hundred other impossible to solve problems.

Even when breaking out computation, often not all data should be moved out for processing. The db will often need to keep lock states in its local caches, before sending data out for outside processing. It often should remain the center of gravity for the source of truth.

The more logic happens near the db, the more surface for errors and edge cases can be reduced and reliability issues avoided. The reduction in distances may also result in improved abstractions and in making the computations efficient in a fundamental thermodynamic way.

This is especially important for high complexity domains where there are lots of pieces of information to coordinate in complex ways.

This, in my mind, often makes it beneficial to have a tight integration with the db, to promote more logic to happen in sql, and to structure tests to cover this db logic. Although as always, it's a balancing act. Breaking things out like number crunching makes sense since it might be too resource intensive to perform in db which can lead to early need for sharding (increasing physical and logical distances in other ways).

The fact that some logic, especially coordination logic, can't be made functionally pure and edge case free without physical proximity to the source of truth, might explain why sql is the most popular functional language. It's the one that is able to fulfill the promise more completely.

To summarize, the postgres-centrism hypothesis says that integrated, entropy-minimizing systems that reduce dependency distances, and anchor logic near the source of truth, leveraging the efficiencies of proximity, often better aligns with fundamentals of computing, physics, information theory, thermodynamics and intelligence. The pieces fit surprisingly well together. It might be the best way to leverage the lightcone around the source of truth.

Acknowledgements: I want to thank Aaron Lefkowitz for feedback on this post.