Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge

1. crazygringo ◴[06 Mar 21 23:35 UTC] No.26371552[source]▶

This is absolutely fascinating, conceptually.

However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.

I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.

So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.

Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.

So I feel like I'm missing something. What is this actually intended for?

I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?

replies(6): >>26371614 #>>26371700 #>>26371748 #>>26371803 #>>26371969 #>>26372126 #

2. jorgemf ◴[06 Mar 21 23:43 UTC] No.26371614[source]▶

>>26371552 (TP) #

Machine Learning. I don't think it has many more use cases

replies(1): >>26371663 #

3. sixdimensional ◴[06 Mar 21 23:50 UTC] No.26371663[source]▶

>>26371614 #

Or more simply put, how about table-driven logic in general? It doesn't have to be as complex as machine learning. There are more use cases than just machine learning, IMHO.

replies(1): >>26371675 #

4. jedberg ◴[06 Mar 21 23:51 UTC] No.26371675{3}[source]▶

>>26371663 #

Such as? I'm having difficulty coming up with any myself.

replies(2): >>26371768 #>>26371775 #

5. zachmu ◴[06 Mar 21 23:55 UTC] No.26371700[source]▶

>>26371552 (TP) #

The application backing use case is best suited for when you have parts of your database that get updated periodically and need human review. So you have a production database that you serve to your customers. Then you have a branch / fork of that (dev) that your development team adds batches of products to. Once a week you do a data release: submit a PR from dev -> prod, have somebody review all the new copy, and merge it once you're happy. If there's a big mistake, just back it out again. We have several paying customers building products around this workflow.

As for lots of people collaborating on data together, we have started a data bounties program where we pay volunteers to assemble large datasets. Two have completed so far, and a third is in progress. For the first one, we paid $25k to assemble precinct-level voting data for the 2016 and 2020 presidential elections. For the second, we paid $10k to get procedure prices for US hospitals. You can read about them here:

https://www.dolthub.com/blog/2021-02-15-election-bounty-revi...

https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/

What's cool is that novices can make a really good income from data entry as a side gig, and it's two orders of magnitude cheaper than hiring a firm to build data sets for you.

You're right that the site is kind of vague about what dolt is "for." It's a really general, very multi-purpose tool that we think will get used a lot of places. Here's a blog we wrote a while back about some of the use cases we envision.

https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/

replies(1): >>26377156 #

6. sixdimensional ◴[07 Mar 21 00:02 UTC] No.26371748[source]▶

>>26371552 (TP) #

I am not associated to Dolt, but I really like the idea of Dolt personally. I do see use cases, but not without challenges.

One of the main use cases you can see them targeting, and that I think makes a ton of sense, is providing tools for collecting, maintaining and publishing reference data sets using crowd sourcing.

For example, they are doing this with hospital charge codes (a.k.a. chargemaster data). Hospitals in the US are required to publish this data for transparency.. however, I have never seen a single aggregated national (or international) data set of all these charges. In fact, such a data set could be worth a lot of money to a lot of organizations for so many reasons. I used to work in health insurance, gathering data from all kinds of sources (government rules/regs, etc.) and it was a lot of hard work, scraping, structuring, maintaining, etc.

This reference data can be used for analytics, to power table-driven business logic, machine learning - to help identify cost inequalities, efficiencies, maybe even illicit price gouging, etc. There are so many reference data sets that have similar characteristics... and "data marketplaces" in a way are targeted at making "private" reference data sets available for sale - so then where is the "open" data marketplace? Well, here you go.. Dolt.

I have often realized that the more ways we can make things collaborative, the better off we will be.

Data is one of those things where, coming up with common, public reference datasets is difficult and there are lots of different perspectives ("branches"), sometimes your data set is missing something and it would be cool if someone could propose it ("pull request"), sometimes you want to compare the old and new version of a dataset ("diff") to see what is different.

One difficult thing about Dolt is, it will only be successful if people are actually willing to work together to cook up and maintain common data sets collaboratively, or if those doing so have an incentive to manage an "open data" project on Dolt as benevolent maintainers, for example. But, I could say then it has the same challenges as "open source" in general, so therefore it is not really that different.

Dolt could even be used as a foundation for a master data management registry - in the sense of you could pop it in as a "communal data fountain" if you will where anybody in your org, or on the Web, etc. could contribute - and you can have benevolent maintainers look over the general quality of the data. Dolt would be missing the data quality/fuzzy matching aspect that master data tools offer, but this is a start for sure.

For example, I work in a giant corporation right now. Each department prefers to maintain its own domain data and in some cases duplicates common data in that domain. Imagine using Dolt to make it possible for all these different domains to collaborate on a single copy of common data in a "Dolt" data set - now people can share data on a single copy and use pull requests, etc. to have an orderly debate on what that common data schema and data set should look like.

I think it's an idea that is very timely.

P.S. Dolt maintainers, if you read this and want to talk, I'm game! Awesome work :)

replies(3): >>26371790 #>>26372652 #>>26386687 #

7. zachmu ◴[07 Mar 21 00:06 UTC] No.26371768{4}[source]▶

>>26371675 #

Say, network configuration.

8. sixdimensional ◴[07 Mar 21 00:07 UTC] No.26371775{4}[source]▶

>>26371675 #

See my post earlier in this thread [1].

Yes you need reference data for machine learning, but the world isn't only about machine learning. You might want reference data for human-interpreted analytics, table-driven logic (business rule engines, for example), etc.

[1] https://news.ycombinator.com/item?id=26371748.

9. zachmu ◴[07 Mar 21 00:10 UTC] No.26371790[source]▶

>>26371748 #

We totally agree on all points! Come chat with us on our discord about how we can help your org solve its problems:

https://discord.com/invite/RFwfYpu

10. curryst ◴[07 Mar 21 00:13 UTC] No.26371803[source]▶

>>26371552 (TP) #

I might look at it for work. Compliance requires us to keep track of who made what change when, and who it was approved by in case regulators need it.

Right now, this often means making an MR on Git with your snippet of SQL, getting it approved, and then manually executing it. This would let us bring the apply stage in, as well as avoid "that SQL query didn't do exactly what I expected" issues.

It's possible to do it within the SQL engine, but then I have to maintain that, which I would prefer not to do. As well as dealing with performance implications from that.

replies(2): >>26371892 #>>26372927 #

11. zachmu ◴[07 Mar 21 00:28 UTC] No.26371892[source]▶

>>26371803 #

That's a great use case, let us know how we can help. We have a discord if you want to chat about it:

https://discord.com/invite/RFwfYpu

12. saulrh ◴[07 Mar 21 00:40 UTC] No.26371969[source]▶

>>26371552 (TP) #

Here's one thing I'd have used it for: Video game assets.

Say you have a tabletop game engine for designing starships. Different settings have different lists of parts. Some settings are run by a game's DM, some are collaborative efforts. I ended up saving the lists of parts in huge JSON files and dumping those into git. However, for much the same reason that data science is often done in a REPL or notebook type interface, it turned out that by far the most efficient way for people to iterate on assets was to boot up the game, fiddle with the parts in-engine until things looked right, then replicate their changes back into the JSON. With this, we could just save the asset database directly.

The same reasoning should hold for effectively any dataset which a) can be factored into encapsulated parts b) isn't natively linear c) needs multiple developers. Game assets are one example, as I described above. Other datasets that hold: ML training/testing sets, dictionaries, spreadsheets, catalogs, datasets for bio papers.

13. fiedzia ◴[07 Mar 21 01:06 UTC] No.26372126[source]▶

>>26371552 (TP) #

This won't work for usual database usecases. This is meant for interactive work with data same way you work with code. Who needs that?

Data scientists working with large datasets. You want to be able to update data without redownloading everything. Also make your local changes (some data cleaning) and propose your updates upstream same way you would with git. Having many people working interactively with data is common here.

One of the companies I work with provided set of data distributed to their partners on a daily basis. Once it grew larger, downloading everything daily became an issue. So that would be desirable,

I have large data model that I need to deploy to production and update once in a while. For code, network usage is kept to minimum because we have git. For data, options are limited.

As with git, it is something that once you have, you will find a lot of usecases that make life easier and open many new doors.

14. a-dub ◴[07 Mar 21 02:43 UTC] No.26372652[source]▶

>>26371748 #

agree 100% this looks awesome for reference datasets.

not so sure how well it would work for live data sources that update with time as it could encourage people to apply more ad-hoc edits as opposed to getting their version controlled jobs to work 100%, but who knows, maybe that would be a net win in some cases?

15. martincolorado ◴[07 Mar 21 03:39 UTC] No.26372927[source]▶

>>26371803 #

I see a use-case for public resource data sets. For example, right of way, land use variance, permits, and land records. These are fundamentally public data sets that are often maintained by entities with limit budgets and the potential (even incentive) for fraud is substantial. Also, there is significant value in being able to access the data set for analytical purposes such as real estate analysis.

16. ◴[07 Mar 21 16:27 UTC] No.26377156[source]▶

>>26371700 #

17. LukeEF ◴[08 Mar 21 15:25 UTC] No.26386687[source]▶

>>26371748 #

Your final example is very like the data mesh ideas coming out of thought works and elsewhere (https://martinfowler.com/articles/data-mesh-principles.html). Data products being owned by data producers and common data models ONLY when required. It is as much organizational model as technology, but I don't really think this maps to SQL tables. You probably want to look at a versioned knowledge graph of some kind. Downside is no SQL, upside is flexibility and speed. (disclaimer - I work over at TerminusDB, the graph version of Dolt)

replies(1): >>26389553 #

18. sixdimensional ◴[08 Mar 21 18:31 UTC] No.26389553{3}[source]▶

>>26386687 #

You're absolutely right that what I'm saying is like data mesh.

The data mesh ideas aren't necessarily new - as this idea that data has been managed/owned by what some would call "federated teams" (optimistic view) and others might call "silos" (pessimistic view) isn't really new - especially in certain industries.

Certainly, the current energy and focus on building collaborative organizational service models, process, standards, etc. and standardizing this around the concept of data mesh is "fresh" and very much on the tip of many people's minds.

I worked at a company that had built a tool which was a data federation engine, and we dabbled with the concepts of "data fabric", "mesh", "network" etc. but we were more talking about tech/implementation vs. architecture and organizational structure. Still, we thought the right tech could enable this architecture/organizational structure and that it wasn't something totally figured out.

Thanks for sharing and mentioning it, and also Terminus. Certainly, I agree that knowledge graph is another approach and one that can also fit very well with the idea of linking data that is both common and specific to domains.

However, I've rarely seen graph databases that had the kind of versioning/time travel that a temporal database (such as relational ones) had built in natively (or anything like "diff").

I don't agree this can't map to SQL tables.. tables and relations (tuples) are just a form of data. People do collaborative data management using relational-based master data management systems already today and have been for many, many years.

That is not to say that graphs aren't a good match for storing common/reference datasets or linking data together. Given that graphs can be considered a superset of relational model, I don't see any reason that you can't do the same thing in graph databases, just that doing diffs and enabling the same kind of workflow might be more difficult in a graph data model - but definitely not impossible.

Lest we forget things like "Linked Data" [1], triple stores, etc. also that have been trying to get at linked data across the entire Internet. However, I never saw collaborative data management as part of that vision, it was mostly for querying.

[1] https://en.wikipedia.org/wiki/Linked_data