(dvc.org)

213 points shcheklein | 4 comments | 19 Oct 24 16:56 UTC | HN request time: 0s | source

1. notrealyme123 ◴[20 Oct 24 03:59 UTC] No.41892751[source]▶

I had a lot of problems when using it with a dataset of many jpg Files.

The indexing for every dvc status took many minutes to check every file. Caching did not work.

Sadly I had to let go of it.

2. woodson ◴[20 Oct 24 04:12 UTC] No.41892807[source]▶

Yes, its performance is rather poor and there can be a lot of headaches with caching (especially if you're using a file system that doesn't support reflinks). For large sharded datasets (e.g. WebDataset), you're better off with other solutions, especially when your ML pipeline can stream them directly from object storage.

replies(1): >>41892964 #

3. dmpetrov ◴[20 Oct 24 04:57 UTC] No.41892964[source]▶

>>41892807 #

Right, DVC caches data for consistency and reproducibility.

If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.

WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...

replies(1): >>41900832 #

4. notrealyme123 ◴[21 Oct 24 05:01 UTC] No.41900832{3}[source]▶

>>41892964 #

Thank you! Thats news to me. I will absolutely give it a try

↑

Data Version Control