Dolt is Git for Data

(github.com)

Show context

throwaway892238 ◴[23 Jun 22 14:38 UTC] No.31849720[source]▶

This is the future of databases, but nobody seems to realize it yet.

One of the biggest problems with databases (particularly SQL ones) is they're a giant pile of mutable state. The whole idea of "migrations" exists because it is impossible to "just" revert any arbitrary change to a database, diff changes automatically, merge changes automatically. You need some kind of intelligent tool or framework to generate DDL, DML, DCL, they have to be applied in turn, something has to check if they've already been applied, etc. And of course you can't roll back a change once it's been applied, unless you create even more program logic to figure out how to do that. It's all a big hack.

By treating a database as version-controlled, you can treat any operation as immutable. Make any change you want and don't worry about conflicts. You can always just go back to the last working version, revert a specific change, merge in one or more changes from different working databases. Make a thousand changes a day, and when one breaks, revert it. No snapshotting and slowly restoring the whole database due to a non-reversible change. Somebody dropped the main table in prod? Just revert the drop. Need to make a change to the prod database but the staging database is different? Branch the prod database, make a change, test it, merge back into prod.

The effect is going to be as radical as the popularization of containers. Whether you like them or not, they are revolutionizing an industry and are a productivity force multiplier.

replies(11): >>31849825 #>>31849875 #>>31849951 #>>31850566 #>>31850778 #>>31851109 #>>31851356 #>>31852067 #>>31853553 #>>31858826 #>>31865675 #

1. 411111111111111 ◴[23 Jun 22 14:49 UTC] No.31849875[source]▶

>>31849720 #

> This is the future of databases, but nobody seems to realize it yet

It's a pipedream, not the future.

Your database is either too big / has too much throughput or migrations just don't matter. And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.

replies(4): >>31850130 #>>31850350 #>>31850869 #>>31854369 #

2. packetlost ◴[23 Jun 22 15:06 UTC] No.31850130[source]▶

>>31849875 (TP) #

I think the problem is relational datasets like that don't fit well into that model. In reality, it's very possible. Look at Datomic. While I agree, for high-throughput systems, storage is a concern, but the general trends seem to be towards streaming data and retention policies anyways.

3. throwaway892238 ◴[23 Jun 22 15:19 UTC] No.31850350[source]▶

>>31849875 (TP) #

You're thinking in terms of the present, but I'm saying it's the future. At present it doesn't make sense, because nobody has yet made a database which does version control on very big datasets with a lot of throughput. But when somebody does make it, it will be completely obvious that this was something we always needed.

It's all just software. There is essentially no limit to what we can make software do as long as the hardware supports it. And there's no hardware limit I know of that says version-controlled databases can't work. We just need to figure out how they will work, and then make 'em (or try to make 'em and in the process figure it out).

> And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.

When you roll back a code change, you don't lose code, as it's still in history. If you need to revert but keep some code, you branch the code, copy the needed code into some other part of the app, revert the old change, merge everything. If on merge there is a conflict (let's presume losing data is a conflict), it can prompt you to issue a set of commands to resolve the conflict before merge. You could do all of that in a branch, test it, merge into prod, and on merge it could perform the same operations. The database does all the heavy lifting and the user just uses a console the way they use Git today.

It's probably going to be required to lock the version of software and the version of the database together, such that both are changed/reverted at the same time. But because this is version control, we could actually serve multiple versions of the same database at the same time. You could have the database present two different versions of itself with the same data COW-overlayed for each version, and two different versions of an application. You could then blue/green deploy both the application and database, each modifying only its version. If you need to revert, you can diff and merge changes from one version to another.

replies(2): >>31850459 #>>31851388 #

4. iamnafets ◴[23 Jun 22 15:27 UTC] No.31850459[source]▶

>>31850350 #

I think the problem is that the tradeoffs already exist. Most users would prefer more usable space or less money to a full history of their data.

You might be making the argument that the usability of immutable data is not there yet, but there are well-established theoretical costs of maintaining full history and I don't think they're within bounds of many real-world use-cases.

replies(2): >>31850643 #>>31851860 #

5. throwaway892238 ◴[23 Jun 22 15:38 UTC] No.31850643{3}[source]▶

>>31850459 #

If the user doesn't want full history they could configure the database to expunge it with a lifecycle policy, though I think keeping deltas of the changes would make any excess file storage negligible, as most people don't seem to ever get rid of data anyway.

6. zachmu ◴[23 Jun 22 15:52 UTC] No.31850869[source]▶

>>31849875 (TP) #

You're suffering from a failure of imagination.

Consider a CMS, one of the most common forms of database backed applications. What if you could give your customer a "dev" branch of all their data to make their changes on and test out new content, that you could then merge with back to prod after somebody reviews it in a standard PR workflow?

This is the workflow one of our earliest customers built. They run network configuration software, and they use Dolt to implement a PR workflow for all changes their customers make.

More details here:

https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/

replies(3): >>31851146 #>>31852808 #>>31863592 #

7. Johannesbourg ◴[23 Jun 22 16:12 UTC] No.31851146[source]▶

>>31850869 #

Personally working with timeseries data my experience is that clients typically underestimate how much storage they need for a single state, let alone including historic versions. The decision people want more data, not more snapshots for a given storage spend. But that's timeseries.

replies(1): >>31851629 #

8. hinkley ◴[23 Jun 22 16:25 UTC] No.31851388[source]▶

>>31850350 #

Do we make databases smarter, or start asking for database-like behavior from version control systems?

I can't help thinking how much time the git team or Jetbrains or Chrome or the Confluence team have spent on backing store implementation/migration and file formats instead of using sqlite (like why aren't V8 heap and perf snapshots just a sqlite database?) but then many of these things operate in that gap. So do we keep improving change over time functionality in databases, or make VCS backends with more formal database-like behavior?

IIRC, Trac stores its wiki history in a subversion repository. Since it already had to understand commit histories and show diffs, that was a sensible choice. Of course it is easier to live with such a decision if the API is good, but I haven't heard anyone say that about any version control system yet.

replies(1): >>31853994 #

9. hinkley ◴[23 Jun 22 16:40 UTC] No.31851629{3}[source]▶

>>31851146 #

They want more data but they don't want to pay for it. People want lots of things, doesn't mean they get it or deserve it.

I can't recall which it was but one of the timeseries databases was bragging on the fact that there are certain situations where scanning a block of data is as cheap as trying to add finer grained indexes to it, especially with ad hoc queries. They did a bunch of benchmarks that said block scanning with compression and parallelism was workable.

And while compression typically leads to write amplification (or very poor compression ratios), in a timeseries database, or a regular database architected in a timeseries-like fashion, modifying the old data is deeply frowned upon (and in fact I've heard people argue for quasi-timeseries behavior because modifying old records is so punishing, especially as the application scales), so as long as you can decide not to compress some pages - new pages - this is not a problem.

10. CPLX ◴[23 Jun 22 16:54 UTC] No.31851860{3}[source]▶

>>31850459 #

As a guy who's been doing technical stuff of one kind or another since the mid 90's I would say that any analysis that insists that a specific use case has tradeoffs due to lack of memory or processing speed has an inevitable expiration date.

11. password4321 ◴[23 Jun 22 17:56 UTC] No.31852808[source]▶

>>31850869 #

See also: https://versionpress.com (not being actively developed)

replies(1): >>31854202 #

12. throwaway892238 ◴[23 Jun 22 19:21 UTC] No.31853994{3}[source]▶

>>31851388 #

Well, they're discrete problems. Version control of source code, packaged applications, container images, databases are all quite different.

Git is a distributed file manager that operates on files where every change is a commit, and a commit is a set of operations on files, and/or a change to a block of text strings terminated by newlines. Versions are merkle trees of commits.

RPM/Deb/etc is a semi-centralized file manager that operates on files assuming each change is a collection of files with executable stages before and after copying/linking/unlinking. Versions are arbitrary key=value pairs which optionally depend on other versions, with extra logic to resolve relative versions.

Docker/OCI is a distributed file manager that operates on layers assuming every layer is a collection of files overlaid on other layers, with extra logic to do extra things with the layers at runtime. Versions are (I think?) merkle trees of layers.

The database is going to need a helluva lot of custom heuristics and operations to do version-control, because how you use it is so much different than the above. Databases are much more complex beasts, require higher performance, higher reliability, tons more functionality.

replies(1): >>31866787 #

13. zachmu ◴[23 Jun 22 19:39 UTC] No.31854202{3}[source]▶

>>31852808 #

If we were less busy we would adopt versionpress and port it to dolt, such a cool product. Some day.

14. whazor ◴[23 Jun 22 19:54 UTC] No.31854369[source]▶

>>31849875 (TP) #

With big data such a model is even more promising. One of the big problems is that people keep copying data sets, which does not scale. Just syncing newer versions is much more efficient, look at delta.io.

15. lazzlazzlazz ◴[24 Jun 22 15:18 UTC] No.31863592[source]▶

>>31850869 #

This is Hacker News. The norm is a deep, thoughtful understanding of the state of the art and a total inability to apply the imagination to the future.

16. hinkley ◴[24 Jun 22 18:05 UTC] No.31866787{4}[source]▶

>>31853994 #

> Well, they're discrete problems.

I'm not convinced they are. Invention is often a case of adopting a solution well known in another discipline. Discovering that these two things share a category. We keep discovering things that are isomorphic to each other.

In the end git is keeping a database of code changes as write once data entries, and presenting the snapshots for human consumption. It does a very bad job of pretending that's not the case.

↑