Dolt is Git for Data

(github.com)

334 points gjvc | 2 comments | 23 Jun 22 11:04 UTC | HN request time: 0.469s | source

Show context

throwaway892238 ◴[23 Jun 22 14:38 UTC] No.31849720[source]▶

This is the future of databases, but nobody seems to realize it yet.

One of the biggest problems with databases (particularly SQL ones) is they're a giant pile of mutable state. The whole idea of "migrations" exists because it is impossible to "just" revert any arbitrary change to a database, diff changes automatically, merge changes automatically. You need some kind of intelligent tool or framework to generate DDL, DML, DCL, they have to be applied in turn, something has to check if they've already been applied, etc. And of course you can't roll back a change once it's been applied, unless you create even more program logic to figure out how to do that. It's all a big hack.

By treating a database as version-controlled, you can treat any operation as immutable. Make any change you want and don't worry about conflicts. You can always just go back to the last working version, revert a specific change, merge in one or more changes from different working databases. Make a thousand changes a day, and when one breaks, revert it. No snapshotting and slowly restoring the whole database due to a non-reversible change. Somebody dropped the main table in prod? Just revert the drop. Need to make a change to the prod database but the staging database is different? Branch the prod database, make a change, test it, merge back into prod.

The effect is going to be as radical as the popularization of containers. Whether you like them or not, they are revolutionizing an industry and are a productivity force multiplier.

replies(11): >>31849825 #>>31849875 #>>31849951 #>>31850566 #>>31850778 #>>31851109 #>>31851356 #>>31852067 #>>31853553 #>>31858826 #>>31865675 #

jandrewrogers ◴[23 Jun 22 17:06 UTC] No.31852067[source]▶

>>31849720 #

This is how relational databases have commonly worked since at least the 1990s and is called multi-version concurrency control (MVCC). Welcome to the future, it is called PostgreSQL. There are at least two reasons no sensible database designer would allow users to operate a database in this way even though they are technically capable of it.

First, keeping every version of every piece of data forever is an excellent way to consume non-intuitively vast amounts of storage even if your data model is tiny. Every time this feature has been offered by databases, it immediately causes a rash of "out of storage" errors that force the user to manually and permanently delete large numbers of old versions. This is extremely user-unfriendly, so the feature is almost immediately removed in subsequent versions because the pain it causes far outweighs the benefits even when used carefully. In typical MVCC systems, old versions are aggressively garbage collected automatically to limit out-of-storage errors.

Second, finding or reconstructing an arbitrary number of old versions of data is unavoidably expensive. Much of the architectural difference between various MVCC implementations are trying to manage the rather severe performance tradeoffs of maintaining multiple versions of data and navigating to the version you need, with the understanding that all of these versions live on storage and rarely in a single place. There is no optimal way, and keeping version chains short is critical for good performance.

There is very deep literature around MVCC-style databases. The challenges of generalizing and maximally exploiting MVCC as a user feature while having performance that is not poor to the point of unusability are thoroughly documented.

replies(1): >>31852336 #

zachmu ◴[23 Jun 22 17:24 UTC] No.31852336[source]▶

>>31852067 #

MVCC is not version control, and time travel / historical querying is not version control.

Dolt's unique functionality isn't time travel, although it has that. It's version control, i.e. branch and merge, push and pull, fork and clone. A bunch of database products give you some of this for schema migrations, but Dolt is the only one that does it for table data as well.

replies(1): >>31854987 #

jandrewrogers ◴[23 Jun 22 20:50 UTC] No.31854987[source]▶

>>31852336 #

The conceit here is the assumption that this has not been built many times by very clever software engineers. It is not a new idea. True git-like version control systems for managing large volumes of data have been built on MVCC kernels for a decades -- branch and merge, push and pull, fork and clone.

There are fundamental computer science and technical issues that make scaling these systems for arbitrary data models extremely difficult. The platforms always had the ambition to be general but the design tradeoffs required to make them scale requires narrowly overfitting for a particular type of data model such that they can only be used for the original use case. And even then, the performance ends up being not good.

I've never designed one from scratch but I've worked on a few at large companies. All of them started with the vision you are proposing, all of them failed at achieving that vision because of the technical tradeoffs required to enable something resembling scalability. Unless you are proposing some novel computer science that renders these issues moot, you aren't presenting a credible defense that this hasn't been done before.

replies(1): >>31855232 #

zachmu ◴[23 Jun 22 21:13 UTC] No.31855232[source]▶

>>31854987 #

Git-like version control requires a Merkle DAG. Unless you know something I don't, there are no RDBMS products that incorporate a Merkle DAG for storage. Dolt is the first.

Table data is stored in a cross between a Merkle DAG and a B Tree (a prolly tree), which is what makes diff / merge performant and scalable. We didn't invent these data structures but we believe we are the first to build a SQL database on them.

https://docs.dolthub.com/architecture/storage-engine/prolly-...

replies(1): >>31855955 #

jandrewrogers ◴[23 Jun 22 22:38 UTC] No.31855955[source]▶

>>31855232 #

> Git-like version control requires a Merkle DAG.

This is false, you are conflating the abstract algorithm with a narrow implementation. That's like saying the only possible sorting algorithm is quicksort.

With all due respect, you seem to be only loosely familiar with database architecture, both theory and historical practice. Nothing you've described is actually novel. That you are unfamiliar with why no one builds things this way, despite many attempts, does not lend confidence.

I am actually a big fan of people trying unorthodox approaches to databases that have never been tried before, this just isn't such an example. Which doesn't make your approach wrong per se, but it leaves you exposed to learning why other people tried and abandoned it.

Tangentially, the "prolly tree" is intrinsically not a scalable data structure. That may satisfy your design requirements but I can't tell.

replies(3): >>31856467 #>>31871902 #>>31885817 #

1. throwaway892238 ◴[23 Jun 22 23:48 UTC] No.31856467[source]▶

>>31855955 #

> you are conflating the abstract algorithm with a narrow implementation

They're literally telling you Git uses a Merkle DAG and they wanted to recreate Git so they used a Merkle DAG. That's not conflating, it's copying.

> you seem to be only loosely familiar with database architecture

Based on what? The only comments GP has made about DBA in this entire HN thread is "this is not MVCC", which is correct.

> I am actually a big fan of people trying unorthodox approaches to databases that have never been tried before

So stop discouraging the neophyte? Sometimes innovation requires the ignorant to figure things out without knowing better, because that way they won't quit before they've begun. Let them figure it out. And if it's not novel, who cares? It doesn't have to be some perfect architectural masterpiece to solve people's problems. If it works well enough just for WordPress blogs, that's pretty great already.

replies(1): >>31861146 #

2. discreteevent ◴[24 Jun 22 13:05 UTC] No.31861146[source]▶

>>31856467 (TP) #

> They're literally telling you Git uses a Merkle DAG

They literally told them: "Git-like version control requires a Merkle DAG."

↑