←back to thread

752 points crazypython | 2 comments | | HN request time: 0.402s | source
Show context
crazygringo ◴[] No.26371552[source]
This is absolutely fascinating, conceptually.

However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.

I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.

So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.

Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.

So I feel like I'm missing something. What is this actually intended for?

I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?

replies(6): >>26371614 #>>26371700 #>>26371748 #>>26371803 #>>26371969 #>>26372126 #
1. zachmu ◴[] No.26371700[source]
The application backing use case is best suited for when you have parts of your database that get updated periodically and need human review. So you have a production database that you serve to your customers. Then you have a branch / fork of that (dev) that your development team adds batches of products to. Once a week you do a data release: submit a PR from dev -> prod, have somebody review all the new copy, and merge it once you're happy. If there's a big mistake, just back it out again. We have several paying customers building products around this workflow.

As for lots of people collaborating on data together, we have started a data bounties program where we pay volunteers to assemble large datasets. Two have completed so far, and a third is in progress. For the first one, we paid $25k to assemble precinct-level voting data for the 2016 and 2020 presidential elections. For the second, we paid $10k to get procedure prices for US hospitals. You can read about them here:

https://www.dolthub.com/blog/2021-02-15-election-bounty-revi...

https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/

What's cool is that novices can make a really good income from data entry as a side gig, and it's two orders of magnitude cheaper than hiring a firm to build data sets for you.

You're right that the site is kind of vague about what dolt is "for." It's a really general, very multi-purpose tool that we think will get used a lot of places. Here's a blog we wrote a while back about some of the use cases we envision.

https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/

replies(1): >>26377156 #
2. ◴[] No.26377156[source]