←back to thread

213 points shcheklein | 5 comments | | HN request time: 0.641s | source
Show context
jerednel ◴[] No.41889752[source]
It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?
replies(1): >>41889814 #
1. riedel ◴[] No.41889814[source]
DVC is (at least as I use it) pretty much just git LFS with multiple backends (guess actually a more simple git annex). It further has some rather MLOps specific stuff. Is handy if you do versions model training with changing data on S3.
replies(3): >>41890760 #>>41890767 #>>41890837 #
2. starkparker ◴[] No.41890760[source]
I've used it for storing rasters alongside georeferencing data in small GIS projects, as an alternative to git LFS. It not only works like git but can integrate with git repos through commit and push/pull hooks, storing DVC pointers and managing .gitignore files while retaining directory structure of the DVC-managed files. It's neat, even if the initial learning curve was a little steep.

We used Google Drive as a storage backend and had to grow out of it to a WebDAV backend, and it was nearly trivial to swap them out and migrate.

3. haensi ◴[] No.41890767[source]
There’s another thread from October 2022 on that topic.

https://news.ycombinator.com/item?id=33047634

What makes DVC especially useful for MLOps? Aren’t MLFlow or W&B solving that in a way that’s open source (the former) or just increases the speed and scale massively ( the latter)?

Disclaimer: I work at W&B.

replies(1): >>41891199 #
4. matrss ◴[] No.41890837[source]
Speaking of git-annex, there is another project called DataLad (https://www.datalad.org/), which has some overlap with DVC. It uses git-annex under the hood and is domain-agnostic, compared to the ML focus that DVC has.
5. riedel ◴[] No.41891199[source]
DVC is much more basic (feels more unix style), integrates really well with any simple CI/CD scripting with git versioning without the need to set up any additional servers.

And it is not either or. People actually combine MLFlow and SVC [0]

[0] https://data-ai.theodo.com/blog-technique/dvc-pipeline-runs-...