Dolt is Git for Data

(github.com)

334 points gjvc | 2 comments | 23 Jun 22 11:04 UTC | HN request time: 0.416s | source

Show context

throwaway892238 ◴[23 Jun 22 14:38 UTC] No.31849720[source]▶

This is the future of databases, but nobody seems to realize it yet.

One of the biggest problems with databases (particularly SQL ones) is they're a giant pile of mutable state. The whole idea of "migrations" exists because it is impossible to "just" revert any arbitrary change to a database, diff changes automatically, merge changes automatically. You need some kind of intelligent tool or framework to generate DDL, DML, DCL, they have to be applied in turn, something has to check if they've already been applied, etc. And of course you can't roll back a change once it's been applied, unless you create even more program logic to figure out how to do that. It's all a big hack.

By treating a database as version-controlled, you can treat any operation as immutable. Make any change you want and don't worry about conflicts. You can always just go back to the last working version, revert a specific change, merge in one or more changes from different working databases. Make a thousand changes a day, and when one breaks, revert it. No snapshotting and slowly restoring the whole database due to a non-reversible change. Somebody dropped the main table in prod? Just revert the drop. Need to make a change to the prod database but the staging database is different? Branch the prod database, make a change, test it, merge back into prod.

The effect is going to be as radical as the popularization of containers. Whether you like them or not, they are revolutionizing an industry and are a productivity force multiplier.

replies(11): >>31849825 #>>31849875 #>>31849951 #>>31850566 #>>31850778 #>>31851109 #>>31851356 #>>31852067 #>>31853553 #>>31858826 #>>31865675 #

411111111111111 ◴[23 Jun 22 14:49 UTC] No.31849875[source]▶

>>31849720 #

> This is the future of databases, but nobody seems to realize it yet

It's a pipedream, not the future.

Your database is either too big / has too much throughput or migrations just don't matter. And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.

replies(4): >>31850130 #>>31850350 #>>31850869 #>>31854369 #

zachmu ◴[23 Jun 22 15:52 UTC] No.31850869[source]▶

>>31849875 #

You're suffering from a failure of imagination.

Consider a CMS, one of the most common forms of database backed applications. What if you could give your customer a "dev" branch of all their data to make their changes on and test out new content, that you could then merge with back to prod after somebody reviews it in a standard PR workflow?

This is the workflow one of our earliest customers built. They run network configuration software, and they use Dolt to implement a PR workflow for all changes their customers make.

More details here:

https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/

replies(3): >>31851146 #>>31852808 #>>31863592 #

1. Johannesbourg ◴[23 Jun 22 16:12 UTC] No.31851146[source]▶

>>31850869 #

Personally working with timeseries data my experience is that clients typically underestimate how much storage they need for a single state, let alone including historic versions. The decision people want more data, not more snapshots for a given storage spend. But that's timeseries.

replies(1): >>31851629 #

2. hinkley ◴[23 Jun 22 16:40 UTC] No.31851629[source]▶

>>31851146 (TP) #

They want more data but they don't want to pay for it. People want lots of things, doesn't mean they get it or deserve it.

I can't recall which it was but one of the timeseries databases was bragging on the fact that there are certain situations where scanning a block of data is as cheap as trying to add finer grained indexes to it, especially with ad hoc queries. They did a bunch of benchmarks that said block scanning with compression and parallelism was workable.

And while compression typically leads to write amplification (or very poor compression ratios), in a timeseries database, or a regular database architected in a timeseries-like fashion, modifying the old data is deeply frowned upon (and in fact I've heard people argue for quasi-timeseries behavior because modifying old records is so punishing, especially as the application scales), so as long as you can decide not to compress some pages - new pages - this is not a problem.

↑