Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge

(github.com)

752 points crazypython | 1 comments | 06 Mar 21 21:15 UTC | HN request time: 0.193s | source

Show context

crazygringo ◴[06 Mar 21 23:35 UTC] No.26371552[source]▶

This is absolutely fascinating, conceptually.

However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.

I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.

So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.

Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.

So I feel like I'm missing something. What is this actually intended for?

I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?

replies(6): >>26371614 #>>26371700 #>>26371748 #>>26371803 #>>26371969 #>>26372126 #

sixdimensional ◴[07 Mar 21 00:02 UTC] No.26371748[source]▶

>>26371552 #

I am not associated to Dolt, but I really like the idea of Dolt personally. I do see use cases, but not without challenges.

One of the main use cases you can see them targeting, and that I think makes a ton of sense, is providing tools for collecting, maintaining and publishing reference data sets using crowd sourcing.

For example, they are doing this with hospital charge codes (a.k.a. chargemaster data). Hospitals in the US are required to publish this data for transparency.. however, I have never seen a single aggregated national (or international) data set of all these charges. In fact, such a data set could be worth a lot of money to a lot of organizations for so many reasons. I used to work in health insurance, gathering data from all kinds of sources (government rules/regs, etc.) and it was a lot of hard work, scraping, structuring, maintaining, etc.

This reference data can be used for analytics, to power table-driven business logic, machine learning - to help identify cost inequalities, efficiencies, maybe even illicit price gouging, etc. There are so many reference data sets that have similar characteristics... and "data marketplaces" in a way are targeted at making "private" reference data sets available for sale - so then where is the "open" data marketplace? Well, here you go.. Dolt.

I have often realized that the more ways we can make things collaborative, the better off we will be.

Data is one of those things where, coming up with common, public reference datasets is difficult and there are lots of different perspectives ("branches"), sometimes your data set is missing something and it would be cool if someone could propose it ("pull request"), sometimes you want to compare the old and new version of a dataset ("diff") to see what is different.

One difficult thing about Dolt is, it will only be successful if people are actually willing to work together to cook up and maintain common data sets collaboratively, or if those doing so have an incentive to manage an "open data" project on Dolt as benevolent maintainers, for example. But, I could say then it has the same challenges as "open source" in general, so therefore it is not really that different.

Dolt could even be used as a foundation for a master data management registry - in the sense of you could pop it in as a "communal data fountain" if you will where anybody in your org, or on the Web, etc. could contribute - and you can have benevolent maintainers look over the general quality of the data. Dolt would be missing the data quality/fuzzy matching aspect that master data tools offer, but this is a start for sure.

For example, I work in a giant corporation right now. Each department prefers to maintain its own domain data and in some cases duplicates common data in that domain. Imagine using Dolt to make it possible for all these different domains to collaborate on a single copy of common data in a "Dolt" data set - now people can share data on a single copy and use pull requests, etc. to have an orderly debate on what that common data schema and data set should look like.

I think it's an idea that is very timely.

P.S. Dolt maintainers, if you read this and want to talk, I'm game! Awesome work :)

replies(3): >>26371790 #>>26372652 #>>26386687 #

1. a-dub ◴[07 Mar 21 02:43 UTC] No.26372652[source]▶

>>26371748 #

agree 100% this looks awesome for reference datasets.

not so sure how well it would work for live data sources that update with time as it could encourage people to apply more ad-hoc edits as opposed to getting their version controlled jobs to work 100%, but who knows, maybe that would be a net win in some cases?

↑