[1] https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...
Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.
We used Google Drive as a storage backend and had to grow out of it to a WebDAV backend, and it was nearly trivial to swap them out and migrate.
https://news.ycombinator.com/item?id=33047634
What makes DVC especially useful for MLOps? Aren’t MLFlow or W&B solving that in a way that’s open source (the former) or just increases the speed and scale massively ( the latter)?
Disclaimer: I work at W&B.
1. File are too large for Git and Git LFS.
2. You prefer using S3/GCS/Azure as a storage.
3. You need to track transformations/piplines on the file - clean up text file, train mode, etc.
Otherwise, vanilla Git may be sufficient.
It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.
There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.
It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.
In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.
And it is not either or. People actually combine MLFlow and SVC [0]
[0] https://data-ai.theodo.com/blog-technique/dvc-pipeline-runs-...
[0] https://dvc.org/doc/user-guide/data-management/remote-storag...
The indexing for every dvc status took many minutes to check every file. Caching did not work.
Sadly I had to let go of it.
If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.
WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...
A hybrid solution of keeping dataset metadata under DVC and then versioning that could work. This was many years ago though and I would be curious if there are any other on-prem data versioning solutions, when I last searched all of them seem geared towards the cloud.
my first impression: dvc is made to use with git where there are arbitrary folders handled by dvc INSIDE your git repo, where oxen is an alternative for a separate data repo. also oxen has lots of integration with dataframes and tabular, ai training and infernece data that dvc is missing. on the other hand dvc has a full DAG pipeline engine integrated as well as import/ export and pluggable backends.
when i say ‘blob’ data, a good example to think of is a set of really long 1080p video files.
tl;dr version
* throw data into dvc when unstructured ‘blob’ data.
* throw it into iceberg when you’ve got structured data.
benefits of dvc over iceberg:
* not forcing ‘blob’ data into a tabular format and all the “fun” (read: annoying) processing steps that come with doing that
* don’t have to have to run some processing step to extract ‘blob’ data out of what is basically a parquet file, dvc pull (?) will just download each file as is.
* edit files locally then run three-ish (?) commands to commit changes, without needing to run a data ingestion pipeline to force ‘blob’ data into a table
* completely schema less, so don’t have to worry about ‘blob’ data being the wrong type, just shove it in the repo and commit it
* roll back throughout all of commit history, not just to the last vacuum/checkpoint
basically, tabular data formats and ‘blob’ data shoved into them is a recipe for pain.
shoving ‘blobs’ into a git like repo is much faster and easier.
especially if you need full version history, branches for different outcomes etc.
trying to have different branches in Iceberg for your set of really long 1080p video files where you have applied different ffmpeg filters in different branches and want people to be able to access all of them, and the history of them sounds nightmare-ish.
in dvc, that’s ^ easy.
basically, it’s like creating a data lake which won’t turn into a data swamp because everything is version controlled.
Specifically, it's a genius way to store large files in git repos directly on any object storage without custom application servers like git-lfs or rewriting git from scratch...
At DagsHub [0], we've integrated directly with DVC for a looong time, so teams can use it with added features like visualizing and labeling datasets managing and models, running experiments collaboratively, and tracking everything (code, data, models, etc.) all in one place.
Just wanted to share that for those already using or considering DVC—there are some options to use it as a building block in a more end-to-end toolchain.
Happy to answer any thoughts or questions!
> Not everybody wants a framework.
The second part of this comment seems strange to me. Surely nothing on Hacker News is shared with the expectation that it will be interesting, or useful, to everyone. Equally, surely there are some people on HN who will be interested in a framework, even if it might be too heavy for other people.
Another potential aspect would be tracking schema evolution in a nicer way than we currently do.
thx in advance, huge fan of anything-as-code and think it’s a great fit for data (20+ years in this area).
Is there any support that would be helpful? I’ll look at the project page too.