←back to thread

213 points shcheklein | 7 comments | | HN request time: 1.642s | source | bottom
1. causal ◴[] No.41891218[source]
This useful for large binaries?
replies(4): >>41891754 #>>41892002 #>>41892136 #>>41895103 #
2. dmpetrov ◴[] No.41891754[source]
Yes. And if you track transformations of the binaries or ml training
3. natsucks ◴[] No.41892002[source]
Would appreciate a good answer to this question. I deal with large medical imaging data (DICOM) and i cannot tell whether it's worth it and/or feasible.
replies(2): >>41892510 #>>41895148 #
4. mkbehbehani ◴[] No.41892136[source]
Yes, I’ve been using it for about a year to populate databases with a reference DB dump. The current file is about 18GB. I use cloudflare R2 as the backing store so even though it’s being pulled very frequently the cloudflare bill is a few bucks per month.
5. thangngoc89 ◴[] No.41892510[source]
It's very much feasible. I'm currently using DVC for DICOM, the repo has growth to about 5TB of small dcm files (less than < 100KB each). We use a NFS mounted NAS for development but the DVC's cache needs to be on the NVMe, otherwise performance would be terrible.
6. kbumsik ◴[] No.41895103[source]
Large files are good, but it may have performance issues with many (millions) small files.
7. tomnicholas1 ◴[] No.41895148[source]
You should look at Icechunk. Your imaging data is structured (it's a multidimensional array), so it should be possible be to represent it as "Virtual Zarr". Then you could commit it to an Icechunk store.

https://earthmover.io/blog/icechunk