Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge

(github.com)

752 points crazypython | 1 comments | 06 Mar 21 21:15 UTC | HN request time: 0.001s | source

Show context

crazygringo ◴[06 Mar 21 23:35 UTC] No.26371552[source]▶

This is absolutely fascinating, conceptually.

However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.

I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.

So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.

Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.

So I feel like I'm missing something. What is this actually intended for?

I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?

replies(6): >>26371614 #>>26371700 #>>26371748 #>>26371803 #>>26371969 #>>26372126 #

sixdimensional ◴[07 Mar 21 00:02 UTC] No.26371748[source]▶

>>26371552 #

I am not associated to Dolt, but I really like the idea of Dolt personally. I do see use cases, but not without challenges.

One of the main use cases you can see them targeting, and that I think makes a ton of sense, is providing tools for collecting, maintaining and publishing reference data sets using crowd sourcing.

For example, they are doing this with hospital charge codes (a.k.a. chargemaster data). Hospitals in the US are required to publish this data for transparency.. however, I have never seen a single aggregated national (or international) data set of all these charges. In fact, such a data set could be worth a lot of money to a lot of organizations for so many reasons. I used to work in health insurance, gathering data from all kinds of sources (government rules/regs, etc.) and it was a lot of hard work, scraping, structuring, maintaining, etc.

This reference data can be used for analytics, to power table-driven business logic, machine learning - to help identify cost inequalities, efficiencies, maybe even illicit price gouging, etc. There are so many reference data sets that have similar characteristics... and "data marketplaces" in a way are targeted at making "private" reference data sets available for sale - so then where is the "open" data marketplace? Well, here you go.. Dolt.

I have often realized that the more ways we can make things collaborative, the better off we will be.

Data is one of those things where, coming up with common, public reference datasets is difficult and there are lots of different perspectives ("branches"), sometimes your data set is missing something and it would be cool if someone could propose it ("pull request"), sometimes you want to compare the old and new version of a dataset ("diff") to see what is different.

One difficult thing about Dolt is, it will only be successful if people are actually willing to work together to cook up and maintain common data sets collaboratively, or if those doing so have an incentive to manage an "open data" project on Dolt as benevolent maintainers, for example. But, I could say then it has the same challenges as "open source" in general, so therefore it is not really that different.

Dolt could even be used as a foundation for a master data management registry - in the sense of you could pop it in as a "communal data fountain" if you will where anybody in your org, or on the Web, etc. could contribute - and you can have benevolent maintainers look over the general quality of the data. Dolt would be missing the data quality/fuzzy matching aspect that master data tools offer, but this is a start for sure.

For example, I work in a giant corporation right now. Each department prefers to maintain its own domain data and in some cases duplicates common data in that domain. Imagine using Dolt to make it possible for all these different domains to collaborate on a single copy of common data in a "Dolt" data set - now people can share data on a single copy and use pull requests, etc. to have an orderly debate on what that common data schema and data set should look like.

I think it's an idea that is very timely.

P.S. Dolt maintainers, if you read this and want to talk, I'm game! Awesome work :)

replies(3): >>26371790 #>>26372652 #>>26386687 #

LukeEF ◴[08 Mar 21 15:25 UTC] No.26386687[source]▶

>>26371748 #

Your final example is very like the data mesh ideas coming out of thought works and elsewhere (https://martinfowler.com/articles/data-mesh-principles.html). Data products being owned by data producers and common data models ONLY when required. It is as much organizational model as technology, but I don't really think this maps to SQL tables. You probably want to look at a versioned knowledge graph of some kind. Downside is no SQL, upside is flexibility and speed. (disclaimer - I work over at TerminusDB, the graph version of Dolt)

replies(1): >>26389553 #

1. sixdimensional ◴[08 Mar 21 18:31 UTC] No.26389553[source]▶

>>26386687 #

You're absolutely right that what I'm saying is like data mesh.

The data mesh ideas aren't necessarily new - as this idea that data has been managed/owned by what some would call "federated teams" (optimistic view) and others might call "silos" (pessimistic view) isn't really new - especially in certain industries.

Certainly, the current energy and focus on building collaborative organizational service models, process, standards, etc. and standardizing this around the concept of data mesh is "fresh" and very much on the tip of many people's minds.

I worked at a company that had built a tool which was a data federation engine, and we dabbled with the concepts of "data fabric", "mesh", "network" etc. but we were more talking about tech/implementation vs. architecture and organizational structure. Still, we thought the right tech could enable this architecture/organizational structure and that it wasn't something totally figured out.

Thanks for sharing and mentioning it, and also Terminus. Certainly, I agree that knowledge graph is another approach and one that can also fit very well with the idea of linking data that is both common and specific to domains.

However, I've rarely seen graph databases that had the kind of versioning/time travel that a temporal database (such as relational ones) had built in natively (or anything like "diff").

I don't agree this can't map to SQL tables.. tables and relations (tuples) are just a form of data. People do collaborative data management using relational-based master data management systems already today and have been for many, many years.

That is not to say that graphs aren't a good match for storing common/reference datasets or linking data together. Given that graphs can be considered a superset of relational model, I don't see any reason that you can't do the same thing in graph databases, just that doing diffs and enabling the same kind of workflow might be more difficult in a graph data model - but definitely not impossible.

Lest we forget things like "Linked Data" [1], triple stores, etc. also that have been trying to get at linked data across the entire Internet. However, I never saw collaborative data management as part of that vision, it was mostly for querying.

[1] https://en.wikipedia.org/wiki/Linked_data

↑