(dvc.org)

1. causal ◴[19 Oct 24 22:17 UTC] No.41891218[source]▶

This useful for large binaries?

replies(4): >>41891754 #>>41892002 #>>41892136 #>>41895103 #

2. dmpetrov ◴[19 Oct 24 23:52 UTC] No.41891754[source]▶

>>41891218 (TP) #

Yes. And if you track transformations of the binaries or ml training

3. natsucks ◴[20 Oct 24 00:48 UTC] No.41892002[source]▶

>>41891218 (TP) #

Would appreciate a good answer to this question. I deal with large medical imaging data (DICOM) and i cannot tell whether it's worth it and/or feasible.

replies(2): >>41892510 #>>41895148 #

4. mkbehbehani ◴[20 Oct 24 01:17 UTC] No.41892136[source]▶

>>41891218 (TP) #

Yes, I’ve been using it for about a year to populate databases with a reference DB dump. The current file is about 18GB. I use cloudflare R2 as the backing store so even though it’s being pulled very frequently the cloudflare bill is a few bucks per month.

5. thangngoc89 ◴[20 Oct 24 02:46 UTC] No.41892510[source]▶

>>41892002 #

It's very much feasible. I'm currently using DVC for DICOM, the repo has growth to about 5TB of small dcm files (less than < 100KB each). We use a NFS mounted NAS for development but the DVC's cache needs to be on the NVMe, otherwise performance would be terrible.

6. kbumsik ◴[20 Oct 24 13:02 UTC] No.41895103[source]▶

>>41891218 (TP) #

Large files are good, but it may have performance issues with many (millions) small files.

7. tomnicholas1 ◴[20 Oct 24 13:11 UTC] No.41895148[source]▶

>>41892002 #

You should look at Icechunk. Your imaging data is structured (it's a multidimensional array), so it should be possible be to represent it as "Virtual Zarr". Then you could commit it to an Icechunk store.

https://earthmover.io/blog/icechunk

↑

Data Version Control