←back to thread

132 points fractalbits | 1 comments | | HN request time: 0s | source
Show context
kburman ◴[] No.46255028[source]
I feel like this product is optimizing for an anti-pattern.

The blog argues that AI workloads are bottlenecked by latency because of 'millions of small files.' But if you are training on millions of loose 4KB objects directly from network storage, your data pipeline is the problem, not the storage layer.

Data Formats: Standard practice is to use formats like WebDataset, Parquet, or TFRecord to chunk small files into large, sequential blobs. This negates the need for high-IOPS metadata operations and makes standard S3 throughput the only metric that matters (which is already plentiful).

Caching: Most high-performance training jobs hydrate local NVMe scratch space on the GPU nodes. S3 is just the cold source of truth. We don't need sub-millisecond access to the source of truth, we need it at the edge (local disk/RAM), which is handled by the data loader pre-fetching.

It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf

replies(6): >>46255366 #>>46255422 #>>46255678 #>>46255722 #>>46255754 #>>46255888 #
deliciousturkey ◴[] No.46255754[source]
In AI training, you want to sample the dataset in arbitrary fashion. You may want to arbitrarily subset your dataset for specific jobs. These are fundamentally opposed demands compared to linear access: To make your tar-file approach work, the data has to ordered to match the sample order of your training workload, coupling data storage and sampler design.

There are solutions for this, but the added complexity is big. In any case, your training code and data storage become tightly coupled. If you can avoid it by having a faster storage solution, at least I would be highly appreciative of it.

replies(1): >>46257281 #
kburman ◴[] No.46257281[source]
- Modern DL frameworks (PyTorch DataLoader, WebDataset, NVIDIA DALI) do not require random access to disk. They stream large sequential shards into a RAM buffer and shuffle within that buffer. As long as the buffer size is significantly larger than the batch size, the statistical convergence of the model is identical to perfect random sampling.

- AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.

- If you truly need "arbitrary subsetting" without downloading a whole tarball, formats like Parquet or indexed TFRecords allow HTTP Range Requests. You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.

replies(1): >>46257440 #
1. deliciousturkey ◴[] No.46257440[source]
Highly dependent on what you are training. "Shuffling within a buffer" still results in your sampling being dependent on the data storage order. PyTorch DataLoader does not handle this for you. High level libraries like DALI do, but this is the exact coupling I wanted to say to avoid. These libraries have specific use cases in mind, and therefore have restrictions that may or may not suit your needs.

AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.

Agree that throughput is more of an issue than latency, as you can queue data to CPU memory. Small object throughput is definitely an issue though, which is what I was talking about. Also, there's no need to use HTTP for your requests, so HTTP or TLS overheads are more of self-induced problems of the storage system itself.

You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.

This has exact same throughput problems as small objects though.