←back to thread

213 points shcheklein | 1 comments | | HN request time: 0s | source
Show context
dmpetrov ◴[] No.41890616[source]
hi there! Maintainer and author here. Excited to see DVC on the front page!

Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.

replies(3): >>41890932 #>>41896923 #>>41897005 #
ajoseps ◴[] No.41890932[source]
if the data files are all just text files, what are the differences between DVC and using plain git?
replies(3): >>41891059 #>>41891080 #>>41893500 #
miki123211 ◴[] No.41891080[source]
DVC does a lot more than git.

It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.

There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.

It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.

In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.

replies(3): >>41891756 #>>41894861 #>>41895262 #
amelius ◴[] No.41895262[source]
Sounds like it is more a framework than a tool.

Not everybody wants a framework.

replies(2): >>41895874 #>>41896912 #
1. stochastastic ◴[] No.41896912[source]
It doesn’t force you to use any of the extra functionality. My team has been using it just for the version control part for a couple years and it has worked great.