←back to thread

752 points crazypython | 1 comments | | HN request time: 0.195s | source
Show context
crazygringo ◴[] No.26371552[source]
This is absolutely fascinating, conceptually.

However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.

I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.

So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.

Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.

So I feel like I'm missing something. What is this actually intended for?

I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?

replies(6): >>26371614 #>>26371700 #>>26371748 #>>26371803 #>>26371969 #>>26372126 #
1. saulrh ◴[] No.26371969[source]
Here's one thing I'd have used it for: Video game assets.

Say you have a tabletop game engine for designing starships. Different settings have different lists of parts. Some settings are run by a game's DM, some are collaborative efforts. I ended up saving the lists of parts in huge JSON files and dumping those into git. However, for much the same reason that data science is often done in a REPL or notebook type interface, it turned out that by far the most efficient way for people to iterate on assets was to boot up the game, fiddle with the parts in-engine until things looked right, then replicate their changes back into the JSON. With this, we could just save the asset database directly.

The same reasoning should hold for effectively any dataset which a) can be factored into encapsulated parts b) isn't natively linear c) needs multiple developers. Game assets are one example, as I described above. Other datasets that hold: ML training/testing sets, dictionaries, spreadsheets, catalogs, datasets for bio papers.