Dolt is Git for Data

(github.com)

Show context

throwaway892238 ◴[23 Jun 22 14:38 UTC] No.31849720[source]▶

This is the future of databases, but nobody seems to realize it yet.

One of the biggest problems with databases (particularly SQL ones) is they're a giant pile of mutable state. The whole idea of "migrations" exists because it is impossible to "just" revert any arbitrary change to a database, diff changes automatically, merge changes automatically. You need some kind of intelligent tool or framework to generate DDL, DML, DCL, they have to be applied in turn, something has to check if they've already been applied, etc. And of course you can't roll back a change once it's been applied, unless you create even more program logic to figure out how to do that. It's all a big hack.

By treating a database as version-controlled, you can treat any operation as immutable. Make any change you want and don't worry about conflicts. You can always just go back to the last working version, revert a specific change, merge in one or more changes from different working databases. Make a thousand changes a day, and when one breaks, revert it. No snapshotting and slowly restoring the whole database due to a non-reversible change. Somebody dropped the main table in prod? Just revert the drop. Need to make a change to the prod database but the staging database is different? Branch the prod database, make a change, test it, merge back into prod.

The effect is going to be as radical as the popularization of containers. Whether you like them or not, they are revolutionizing an industry and are a productivity force multiplier.

replies(11): >>31849825 #>>31849875 #>>31849951 #>>31850566 #>>31850778 #>>31851109 #>>31851356 #>>31852067 #>>31853553 #>>31858826 #>>31865675 #

411111111111111 ◴[23 Jun 22 14:49 UTC] No.31849875[source]▶

>>31849720 #

> This is the future of databases, but nobody seems to realize it yet

It's a pipedream, not the future.

Your database is either too big / has too much throughput or migrations just don't matter. And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.

replies(4): >>31850130 #>>31850350 #>>31850869 #>>31854369 #

1. throwaway892238 ◴[23 Jun 22 15:19 UTC] No.31850350[source]▶

>>31849875 #

You're thinking in terms of the present, but I'm saying it's the future. At present it doesn't make sense, because nobody has yet made a database which does version control on very big datasets with a lot of throughput. But when somebody does make it, it will be completely obvious that this was something we always needed.

It's all just software. There is essentially no limit to what we can make software do as long as the hardware supports it. And there's no hardware limit I know of that says version-controlled databases can't work. We just need to figure out how they will work, and then make 'em (or try to make 'em and in the process figure it out).

> And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.

When you roll back a code change, you don't lose code, as it's still in history. If you need to revert but keep some code, you branch the code, copy the needed code into some other part of the app, revert the old change, merge everything. If on merge there is a conflict (let's presume losing data is a conflict), it can prompt you to issue a set of commands to resolve the conflict before merge. You could do all of that in a branch, test it, merge into prod, and on merge it could perform the same operations. The database does all the heavy lifting and the user just uses a console the way they use Git today.

It's probably going to be required to lock the version of software and the version of the database together, such that both are changed/reverted at the same time. But because this is version control, we could actually serve multiple versions of the same database at the same time. You could have the database present two different versions of itself with the same data COW-overlayed for each version, and two different versions of an application. You could then blue/green deploy both the application and database, each modifying only its version. If you need to revert, you can diff and merge changes from one version to another.

replies(2): >>31850459 #>>31851388 #

2. iamnafets ◴[23 Jun 22 15:27 UTC] No.31850459[source]▶

>>31850350 (TP) #

I think the problem is that the tradeoffs already exist. Most users would prefer more usable space or less money to a full history of their data.

You might be making the argument that the usability of immutable data is not there yet, but there are well-established theoretical costs of maintaining full history and I don't think they're within bounds of many real-world use-cases.

replies(2): >>31850643 #>>31851860 #

3. throwaway892238 ◴[23 Jun 22 15:38 UTC] No.31850643[source]▶

>>31850459 #

If the user doesn't want full history they could configure the database to expunge it with a lifecycle policy, though I think keeping deltas of the changes would make any excess file storage negligible, as most people don't seem to ever get rid of data anyway.

4. hinkley ◴[23 Jun 22 16:25 UTC] No.31851388[source]▶

>>31850350 (TP) #

Do we make databases smarter, or start asking for database-like behavior from version control systems?

I can't help thinking how much time the git team or Jetbrains or Chrome or the Confluence team have spent on backing store implementation/migration and file formats instead of using sqlite (like why aren't V8 heap and perf snapshots just a sqlite database?) but then many of these things operate in that gap. So do we keep improving change over time functionality in databases, or make VCS backends with more formal database-like behavior?

IIRC, Trac stores its wiki history in a subversion repository. Since it already had to understand commit histories and show diffs, that was a sensible choice. Of course it is easier to live with such a decision if the API is good, but I haven't heard anyone say that about any version control system yet.

replies(1): >>31853994 #

5. CPLX ◴[23 Jun 22 16:54 UTC] No.31851860[source]▶

>>31850459 #

As a guy who's been doing technical stuff of one kind or another since the mid 90's I would say that any analysis that insists that a specific use case has tradeoffs due to lack of memory or processing speed has an inevitable expiration date.

6. throwaway892238 ◴[23 Jun 22 19:21 UTC] No.31853994[source]▶

>>31851388 #

Well, they're discrete problems. Version control of source code, packaged applications, container images, databases are all quite different.

Git is a distributed file manager that operates on files where every change is a commit, and a commit is a set of operations on files, and/or a change to a block of text strings terminated by newlines. Versions are merkle trees of commits.

RPM/Deb/etc is a semi-centralized file manager that operates on files assuming each change is a collection of files with executable stages before and after copying/linking/unlinking. Versions are arbitrary key=value pairs which optionally depend on other versions, with extra logic to resolve relative versions.

Docker/OCI is a distributed file manager that operates on layers assuming every layer is a collection of files overlaid on other layers, with extra logic to do extra things with the layers at runtime. Versions are (I think?) merkle trees of layers.

The database is going to need a helluva lot of custom heuristics and operations to do version-control, because how you use it is so much different than the above. Databases are much more complex beasts, require higher performance, higher reliability, tons more functionality.

replies(1): >>31866787 #

7. hinkley ◴[24 Jun 22 18:05 UTC] No.31866787{3}[source]▶

>>31853994 #

> Well, they're discrete problems.

I'm not convinced they are. Invention is often a case of adopting a solution well known in another discipline. Discovering that these two things share a category. We keep discovering things that are isomorphic to each other.

In the end git is keeping a database of code changes as write once data entries, and presenting the snapshots for human consumption. It does a very bad job of pretending that's not the case.

↑