←back to thread

132 points fractalbits | 1 comments | | HN request time: 0s | source
Show context
kburman ◴[] No.46255028[source]
I feel like this product is optimizing for an anti-pattern.

The blog argues that AI workloads are bottlenecked by latency because of 'millions of small files.' But if you are training on millions of loose 4KB objects directly from network storage, your data pipeline is the problem, not the storage layer.

Data Formats: Standard practice is to use formats like WebDataset, Parquet, or TFRecord to chunk small files into large, sequential blobs. This negates the need for high-IOPS metadata operations and makes standard S3 throughput the only metric that matters (which is already plentiful).

Caching: Most high-performance training jobs hydrate local NVMe scratch space on the GPU nodes. S3 is just the cold source of truth. We don't need sub-millisecond access to the source of truth, we need it at the edge (local disk/RAM), which is handled by the data loader pre-fetching.

It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf

replies(6): >>46255366 #>>46255422 #>>46255678 #>>46255722 #>>46255754 #>>46255888 #
1. jeremyjh ◴[] No.46255366[source]
Yeah I was a bit lost from the introduction. High performance object stores are "too expensive?" We live an era where I can store everything forever and query it in human scale time-frames at costs that are far less than what we paid for much worse technologies a decade ago. But I was thinking of datalakes, not vector stores or whatever they are trying to solve for AI.