Dolt is Git for Data

(github.com)

334 points gjvc | 3 comments | 23 Jun 22 11:04 UTC | HN request time: 0.513s | source

Show context

cosmic_quanta ◴[23 Jun 22 12:00 UTC] No.31847838[source]▶

That looks awesome. One of the listed use-cases is 'time-travel': https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...

I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).

Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool

replies(5): >>31847959 #>>31848014 #>>31849805 #>>31849874 #>>31859003 #

kortex ◴[23 Jun 22 12:22 UTC] No.31848014[source]▶

>>31847838 #

Have you looked at dvc www.dvc.org? Takes a little bit to figure out how you want to handle the backing store (usually s3) but then it's very straightforward. You could do a similar pattern: have a data repository and simply move the git HEAD to the desired spot and dvc automatically adds/removes the data files based on what's in the commit. You can even version binaries, without blowing up your .git tree.

replies(2): >>31848910 #>>31849064 #

1. isolli ◴[23 Jun 22 13:34 UTC] No.31848910[source]▶

>>31848014 #

I'm looking into DVC right now, and I feel like the code history (in git) and the data history are too intertwined. If you move the git HEAD back, then you get the old data back, but you also get the old code back. I wish there was a way to move the two "heads" independently. Or is there?

Edit: I can always revert the contents of the .dvc folder to a previous commit, but I wonder if there's a more natural way of doing it.

replies(2): >>31849719 #>>31854496 #

2. george_ciobanu ◴[23 Jun 22 14:38 UTC] No.31849719[source]▶

>>31848910 (TP) #

also check out Datomic.

3. arjvik ◴[23 Jun 22 20:05 UTC] No.31854496[source]▶

>>31848910 (TP) #

If you want the dataset to be independent, I would recommend having a seperate repository for the dataset, and using Git Submodules to pull it in. That way you can checkout different versions of the dataset and code because they are essentially in seperate working trees.

↑