Most active commenters

dmpetrov(5)

Popular/hot comments

>>41891218 #
>>41889814 #
>>41890028 #
>>41890616 #
>>41890932 #
>>41891080 #
>>41892887 #

Data Version Control

(dvc.org)

1. jerednel ◴[19 Oct 24 18:49 UTC] No.41889752[source]▶

It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?

replies(1): >>41889814 #

2. riedel ◴[19 Oct 24 18:57 UTC] No.41889814[source]▶

>>41889752 #

DVC is (at least as I use it) pretty much just git LFS with multiple backends (guess actually a more simple git annex). It further has some rather MLOps specific stuff. Is handy if you do versions model training with changing data on S3.

replies(3): >>41890760 #>>41890767 #>>41890837 #

3. shicholas ◴[19 Oct 24 19:24 UTC] No.41890028[source]▶

>>41888937 (OP) #

What are the benefits of DVC over Apache Iceberg? If anyone used both, I'd be curious about your take. Thanks!

replies(3): >>41890570 #>>41895053 #>>41895113 #

4. ◴[19 Oct 24 19:45 UTC] No.41890203[source]▶

>>41888937 (OP) #

5. bramathon ◴[19 Oct 24 19:48 UTC] No.41890226[source]▶

>>41888937 (OP) #

I've used DVC for most of my projects for the past five years. The good things is that it works a lot like git. If your scientists understand branches, commits and diffs, they should be able to understand DVC. The bad thing is that it works like git. Scientists often do not, in fact, understand or use branches, commits and diffs. The best thing is that it essentially forces you to follow Ten Simple Rules for Reproducible Computational Research [1]. Reproducibility has been a huge challenge on teams I've worked on.

[1] https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

replies(1): >>41954310 #

6. andrew_lettuce ◴[19 Oct 24 20:33 UTC] No.41890570[source]▶

>>41890028 #

I don't see any real benefits, as it feels like using the tool you already know even though it's not quite right. Iceberg is maybe geared towards slower changing models than this approach?

replies(1): >>41891013 #

7. dmpetrov ◴[19 Oct 24 20:39 UTC] No.41890616[source]▶

>>41888937 (OP) #

hi there! Maintainer and author here. Excited to see DVC on the front page!

Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.

replies(3): >>41890932 #>>41896923 #>>41897005 #

8. starkparker ◴[19 Oct 24 21:00 UTC] No.41890760{3}[source]▶

>>41889814 #

I've used it for storing rasters alongside georeferencing data in small GIS projects, as an alternative to git LFS. It not only works like git but can integrate with git repos through commit and push/pull hooks, storing DVC pointers and managing .gitignore files while retaining directory structure of the DVC-managed files. It's neat, even if the initial learning curve was a little steep.

We used Google Drive as a storage backend and had to grow out of it to a WebDAV backend, and it was nearly trivial to swap them out and migrate.

9. haensi ◴[19 Oct 24 21:01 UTC] No.41890767{3}[source]▶

>>41889814 #

There’s another thread from October 2022 on that topic.

https://news.ycombinator.com/item?id=33047634

What makes DVC especially useful for MLOps? Aren’t MLFlow or W&B solving that in a way that’s open source (the former) or just increases the speed and scale massively ( the latter)?

Disclaimer: I work at W&B.

replies(1): >>41891199 #

10. matrss ◴[19 Oct 24 21:15 UTC] No.41890837{3}[source]▶

>>41889814 #

Speaking of git-annex, there is another project called DataLad (https://www.datalad.org/), which has some overlap with DVC. It uses git-annex under the hood and is domain-agnostic, compared to the ML focus that DVC has.

11. ajoseps ◴[19 Oct 24 21:31 UTC] No.41890932[source]▶

>>41890616 #

if the data files are all just text files, what are the differences between DVC and using plain git?

replies(3): >>41891059 #>>41891080 #>>41893500 #

12. foobarbecue ◴[19 Oct 24 21:48 UTC] No.41891013{3}[source]▶

>>41890570 #

username checks out

replies(1): >>41892319 #

13. dmpetrov ◴[19 Oct 24 21:55 UTC] No.41891059{3}[source]▶

>>41890932 #

In this cases, you need DVC if:

1. File are too large for Git and Git LFS.

2. You prefer using S3/GCS/Azure as a storage.

3. You need to track transformations/piplines on the file - clean up text file, train mode, etc.

Otherwise, vanilla Git may be sufficient.

14. miki123211 ◴[19 Oct 24 21:58 UTC] No.41891080{3}[source]▶

>>41890932 #

DVC does a lot more than git.

It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.

There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.

It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.

In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.

replies(3): >>41891756 #>>41894861 #>>41895262 #

15. riedel ◴[19 Oct 24 22:14 UTC] No.41891199{4}[source]▶

>>41890767 #

DVC is much more basic (feels more unix style), integrates really well with any simple CI/CD scripting with git versioning without the need to set up any additional servers.

And it is not either or. People actually combine MLFlow and SVC [0]

[0] https://data-ai.theodo.com/blog-technique/dvc-pipeline-runs-...

16. causal ◴[19 Oct 24 22:17 UTC] No.41891218[source]▶

>>41888937 (OP) #

This useful for large binaries?

replies(4): >>41891754 #>>41892002 #>>41892136 #>>41895103 #

17. dmpetrov ◴[19 Oct 24 23:52 UTC] No.41891754[source]▶

>>41891218 #

Yes. And if you track transformations of the binaries or ml training

18. azinman2 ◴[19 Oct 24 23:52 UTC] No.41891756{4}[source]▶

>>41891080 #

So where do the adjusted 10M rows live instead? S3?

replies(1): >>41892535 #

19. natsucks ◴[20 Oct 24 00:48 UTC] No.41892002[source]▶

>>41891218 #

Would appreciate a good answer to this question. I deal with large medical imaging data (DICOM) and i cannot tell whether it's worth it and/or feasible.

replies(2): >>41892510 #>>41895148 #

20. mkbehbehani ◴[20 Oct 24 01:17 UTC] No.41892136[source]▶

>>41891218 #

Yes, I’ve been using it for about a year to populate databases with a reference DB dump. The current file is about 18GB. I use cloudflare R2 as the backing store so even though it’s being pulled very frequently the cloudflare bill is a few bucks per month.

21. dmd ◴[20 Oct 24 01:59 UTC] No.41892319{4}[source]▶

>>41891013 #

You must be into Apache Ignite.

22. thangngoc89 ◴[20 Oct 24 02:46 UTC] No.41892510{3}[source]▶

>>41892002 #

It's very much feasible. I'm currently using DVC for DICOM, the repo has growth to about 5TB of small dcm files (less than < 100KB each). We use a NFS mounted NAS for development but the DVC's cache needs to be on the NVMe, otherwise performance would be terrible.

23. thangngoc89 ◴[20 Oct 24 02:55 UTC] No.41892535{5}[source]▶

>>41891756 #

DVC support multiple remotes. S3 is one of them, there are also WebDAV, local FS, Google Drive, and a bunch of others. You could see the full list here [0]. Disclaimer: not affiliated with DVC in anyway, just a user.

[0] https://dvc.org/doc/user-guide/data-management/remote-storag...

24. notrealyme123 ◴[20 Oct 24 03:59 UTC] No.41892751[source]▶

>>41888937 (OP) #

I had a lot of problems when using it with a dataset of many jpg Files.

The indexing for every dvc status took many minutes to check every file. Caching did not work.

Sadly I had to let go of it.

replies(1): >>41892807 #

25. woodson ◴[20 Oct 24 04:12 UTC] No.41892807[source]▶

>>41892751 #

Yes, its performance is rather poor and there can be a lot of headaches with caching (especially if you're using a file system that doesn't support reflinks). For large sharded datasets (e.g. WebDataset), you're better off with other solutions, especially when your ML pipeline can stream them directly from object storage.

replies(1): >>41892964 #

26. jiangplus ◴[20 Oct 24 04:35 UTC] No.41892887[source]▶

>>41888937 (OP) #

How does it compare to Oxen?

https://github.com/Oxen-AI/Oxen

replies(3): >>41894597 #>>41895851 #>>41896651 #

27. dmpetrov ◴[20 Oct 24 04:57 UTC] No.41892964{3}[source]▶

>>41892807 #

Right, DVC caches data for consistency and reproducibility.

If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.

WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...

replies(1): >>41900832 #

28. ulnarkressty ◴[20 Oct 24 05:41 UTC] No.41893143[source]▶

>>41888937 (OP) #

We actually were considering DVC, however for our particular use case (huge video files which don't change much) the git paradigm was not that useful - you need at least one copy of the data on the origin and another one on each system that's doing the training. So in the end we just went with files and folders on a NAS, seemed to work good enough.

A hybrid solution of keeping dataset metadata under DVC and then versioning that could work. This was many years ago though and I would be curious if there are any other on-prem data versioning solutions, when I last searched all of them seem geared towards the cloud.

29. tomtom1337 ◴[20 Oct 24 06:44 UTC] No.41893419[source]▶

>>41888937 (OP) #

The animated ripple across the «what’s new» button is infuriating. It keeps drawing my attention from reading what this is.

replies(2): >>41895810 #>>41908862 #

30. agile-gift0262 ◴[20 Oct 24 07:03 UTC] No.41893500{3}[source]▶

>>41890932 #

It's not just to manage file versioning. Yo can define a pipeline with different stages, the dependencies and outputs of each stage and DVC will figure out which stages need running depending on what dependencies have changed. Stages can also output metrics and plots, and DVC has utilities to expose, explore and compare those.

31. sohooo ◴[20 Oct 24 08:03 UTC] No.41893754[source]▶

>>41888937 (OP) #

I also heart about lakeFS for data versioning on S3 object stores. Is DVC a contender in this area?

32. jFriedensreich ◴[20 Oct 24 11:17 UTC] No.41894597[source]▶

>>41892887 #

never heard of oxen but it looks like a super interesting alternative. would love to hear from someone who has experience with both.

my first impression: dvc is made to use with git where there are arbitrary folders handled by dvc INSIDE your git repo, where oxen is an alternative for a separate data repo. also oxen has lots of integration with dataframes and tabular, ai training and infernece data that dvc is missing. on the other hand dvc has a full DAG pipeline engine integrated as well as import/ export and pluggable backends.

33. woodglyst ◴[20 Oct 24 12:20 UTC] No.41894861{4}[source]▶

>>41891080 #

This sounds a lot like the experimental project Jacquard [0] from Ink & Switch.

[0] https://www.inkandswitch.com/jacquard/notebook/

34. dijksterhuis ◴[20 Oct 24 12:53 UTC] No.41895053[source]▶

>>41890028 #

head is a bit discombobulated today, but i’ll give this a shot

when i say ‘blob’ data, a good example to think of is a set of really long 1080p video files.

tl;dr version

* throw data into dvc when unstructured ‘blob’ data.

* throw it into iceberg when you’ve got structured data.

benefits of dvc over iceberg:

* not forcing ‘blob’ data into a tabular format and all the “fun” (read: annoying) processing steps that come with doing that

* don’t have to have to run some processing step to extract ‘blob’ data out of what is basically a parquet file, dvc pull (?) will just download each file as is.

* edit files locally then run three-ish (?) commands to commit changes, without needing to run a data ingestion pipeline to force ‘blob’ data into a table

* completely schema less, so don’t have to worry about ‘blob’ data being the wrong type, just shove it in the repo and commit it

* roll back throughout all of commit history, not just to the last vacuum/checkpoint

basically, tabular data formats and ‘blob’ data shoved into them is a recipe for pain.

shoving ‘blobs’ into a git like repo is much faster and easier.

especially if you need full version history, branches for different outcomes etc.

trying to have different branches in Iceberg for your set of really long 1080p video files where you have applied different ffmpeg filters in different branches and want people to be able to access all of them, and the history of them sounds nightmare-ish.

in dvc, that’s ^ easy.

basically, it’s like creating a data lake which won’t turn into a data swamp because everything is version controlled.

replies(1): >>41908659 #

35. kbumsik ◴[20 Oct 24 13:02 UTC] No.41895103[source]▶

>>41891218 #

Large files are good, but it may have performance issues with many (millions) small files.

36. tomnicholas1 ◴[20 Oct 24 13:04 UTC] No.41895113[source]▶

>>41890028 #

If you're wondering this you should look at Icechunk too, which was open-sourced just this week. It's Apache Iceberg but for multidimensional data (e.g. Zarr).

https://earthmover.io/blog/icechunk

https://news.ycombinator.com/item?id=41850352

37. tomnicholas1 ◴[20 Oct 24 13:11 UTC] No.41895148{3}[source]▶

>>41892002 #

You should look at Icechunk. Your imaging data is structured (it's a multidimensional array), so it should be possible be to represent it as "Virtual Zarr". Then you could commit it to an Icechunk store.

https://earthmover.io/blog/icechunk

38. amelius ◴[20 Oct 24 13:32 UTC] No.41895262{4}[source]▶

>>41891080 #

Sounds like it is more a framework than a tool.

Not everybody wants a framework.

replies(2): >>41895874 #>>41896912 #

39. dpleban ◴[20 Oct 24 14:19 UTC] No.41895530[source]▶

>>41888937 (OP) #

Great to see DVC being discussed here! As a tool, it’s done a lot to simplify version control for data and models, and it’s been a game-changer for many in the MLOps space.

Specifically, it's a genius way to store large files in git repos directly on any object storage without custom application servers like git-lfs or rewriting git from scratch...

At DagsHub [0], we've integrated directly with DVC for a looong time, so teams can use it with added features like visualizing and labeling datasets managing and models, running experiments collaboratively, and tracking everything (code, data, models, etc.) all in one place.

Just wanted to share that for those already using or considering DVC—there are some options to use it as a building block in a more end-to-end toolchain.

[0] https://dagshub.com

40. FergusArgyll ◴[20 Oct 24 15:07 UTC] No.41895810[source]▶

>>41893419 #

I bound Ctrl-Shift-Z to Ublock Origin's Zapper mode. It's really helpful

41. gregschoeninger ◴[20 Oct 24 15:17 UTC] No.41895851[source]▶

>>41892887 #

Maintainer of Oxen here, we initially built Oxen because DVC was pretty painfully slow to work with, and had a lot of extra bells and whistles that we didn’t need. Under the hood we optimized the merkle tree structure, hashing algorithms, network protocols, etc to make it speedy when it came to large datasets. We have a pretty nice front end at https://oxen.ai for viewing and querying the data as well.

Happy to answer any thoughts or questions!

42. JadeNB ◴[20 Oct 24 15:20 UTC] No.41895874{5}[source]▶

>>41895262 #

> Sounds like it is more a framework than a tool.

> Not everybody wants a framework.

The second part of this comment seems strange to me. Surely nothing on Hacker News is shared with the expectation that it will be interesting, or useful, to everyone. Equally, surely there are some people on HN who will be interested in a framework, even if it might be too heavy for other people.

replies(1): >>41896274 #

43. amelius ◴[20 Oct 24 16:05 UTC] No.41896274{6}[source]▶

>>41895874 #

Just saying that what makes Git so appealing is that it does one thing well, and from this view DVC seems to be in an entirely different category.

44. bagavi ◴[20 Oct 24 16:46 UTC] No.41896651[source]▶

>>41892887 #

Can this be used with GitHub? If yes, I would shift from dvc immediately

45. stochastastic ◴[20 Oct 24 17:17 UTC] No.41896912{5}[source]▶

>>41895262 #

It doesn’t force you to use any of the extra functionality. My team has been using it just for the version control part for a couple years and it has worked great.

replies(1): >>41954372 #

46. johanneskanybal ◴[20 Oct 24 17:19 UTC] No.41896923[source]▶

>>41890616 #

Mostly consult as a data engineer not ML ops but I’m interested in some aspects of this. We have 10 years of parquet files from 300+ different kafka topic and we’re currently migrating to apache iceberg. We’ll back fill on a need only basis and it would be nice to track that with git. Would this be a good fit for that?

Another potential aspect would be tracking schema evolution in a nicer way than we currently do.

thx in advance, huge fan of anything-as-code and think it’s a great fit for data (20+ years in this area).

47. stochastastic ◴[20 Oct 24 17:30 UTC] No.41897005[source]▶

>>41890616 #

Thanks for making and sharing DVC! It’s been a big help.

Is there any support that would be helpful? I’ll look at the project page too.

replies(1): >>41897163 #

48. dmpetrov ◴[20 Oct 24 17:53 UTC] No.41897163{3}[source]▶

>>41897005 #

Thank you!

Just shoot an email to support and mention HN. I’ll read and reply.

49. notrealyme123 ◴[21 Oct 24 05:01 UTC] No.41900832{4}[source]▶

>>41892964 #

Thank you! Thats news to me. I will absolutely give it a try

50. shicholas ◴[21 Oct 24 21:24 UTC] No.41908659{3}[source]▶

>>41895053 #

This is great thank you!

51. jendefig ◴[21 Oct 24 21:47 UTC] No.41908862[source]▶

>>41893419 #

Thank you for your feedback! We can see how this could be distracting. I'm the Community Manager, this has been fixed!

52. bach4ants ◴[26 Oct 24 12:20 UTC] No.41954310[source]▶

>>41890226 #

I have noticed this as well. There is a huge resistance to learning Git, and I think it's partly warranted. Researchers know what it is, and know that it's valuable, but think it will take too long to learn and they want to move fast. I recently started building a tool called Calkit (https://github.com/calkit/calkit) in an attempt to simply and unify Git and DVC for these types of researchers. Hoping to convince folks that working reproducibly is actually faster in the long run, never mind the fact that it makes their work more directly usable for pushing the field forward more quickly overall.

53. bach4ants ◴[26 Oct 24 12:33 UTC] No.41954372{6}[source]▶

>>41896912 #

Yep. I personally like DVC's pipeline implementation because it's lightweight and language-agnostic, but haven't gotten into using their experiment tracking features.

↑