←back to thread

752 points crazypython | 1 comments | | HN request time: 0.202s | source
Show context
strogonoff ◴[] No.26370733[source]
You can also use Git for data!

It’s a bit slower, but smart use of partial/shallow clones can address performance degradation on large repositories over time. You just need to take care of the transformation between “physical” trees/blobs and “logical” objects in your dataset (which may not have 1:1 mapping, as having physical layer more granular reduces likelihood of merge conflicts).

I’m also following Pijul, which seems very promising in regards to versioning data—I believe they might introduce primitives allowing to operate on changes in actual data structures rather than between lines in files, like with Git.

Add to that sound theory of patches, and that’s a definite win over Git (or Doit for that matter, which seems to be same old Git but for SQL).

replies(5): >>26371219 #>>26371307 #>>26371593 #>>26372041 #>>26373741 #
1. rapjr9 ◴[] No.26372041[source]
We used git to store and manage data sets for a machine learning project involving chewing detection with audio data used in training. It was cumbersome and the huge datasets caused some problems with git (e.g., searches of our code base got really slow because the data was being searched also until we moved the data to a different repo). Something easier to use that could manage large datasets would be useful.

I wonder if DoIt could be used to create a clone of Apple's Time Machine. Seems like the basics are there.