←back to thread

213 points shcheklein | 4 comments | | HN request time: 0.002s | source
1. notrealyme123 ◴[] No.41892751[source]
I had a lot of problems when using it with a dataset of many jpg Files.

The indexing for every dvc status took many minutes to check every file. Caching did not work.

Sadly I had to let go of it.

replies(1): >>41892807 #
2. woodson ◴[] No.41892807[source]
Yes, its performance is rather poor and there can be a lot of headaches with caching (especially if you're using a file system that doesn't support reflinks). For large sharded datasets (e.g. WebDataset), you're better off with other solutions, especially when your ML pipeline can stream them directly from object storage.
replies(1): >>41892964 #
3. dmpetrov ◴[] No.41892964[source]
Right, DVC caches data for consistency and reproducibility.

If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.

WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...

replies(1): >>41900832 #
4. notrealyme123 ◴[] No.41900832{3}[source]
Thank you! Thats news to me. I will absolutely give it a try