←back to thread

93 points rbanffy | 3 comments | | HN request time: 0s | source
Show context
pama ◴[] No.42188372[source]
Noting here that 2700 quadrillion operations per second is less than the estimated sustained throughput of productive bfloat16 compute during the training of the large llama3 models, which IIRC was about 45% of 16,000 quadrillion operations per second, ie 16k H100 in parallel at about 0.45 MFU. The compute power of national labs has fallen far behind industry in recent years.
replies(3): >>42188382 #>>42188389 #>>42188415 #
bryanlarsen ◴[] No.42188415[source]
A 64 bit float operation is >4X as expensive as a 16 bit float operation.
replies(2): >>42188503 #>>42188504 #
1. pama ◴[] No.42188504[source]
Agreed. However also note that if it was only matrix multiplies and no full transformer training, the performance of that Meta cluster would be closer to 16k PFlops/s, still much faster than the El Capitain performance measured on linpack and multiplied by 4. Other companies presumably cabled 100k H100s together, but they dont yet publish training data for their LLMs. It is good to have competition, I just didnt expect the tables to flip so dramatically over the last two decades from a time when governments still ruled the top spots in computer centers with ease to nowadays where the assumption is that there are at least ten companies with larger clusters than the most powerful governments.
replies(1): >>42189800 #
2. sliken ◴[] No.42189800[source]
I'd expect linpack to be much closer to a user research application than training LLMs. My understanding of LLMs is that it's more about throughput and has a very predictable communication patterns, not latency sensitive, and bandwidth intensive.

Most parallel research, especially at this scale is more about different balance of operations to memory bandwidth, and much more worried about interconnect latency.

I wouldn't assume that just because various corporations have large training clusters that they could dominate HPC if they wanted to. Hyperscalers have dominated throughput for many years now, but HPC is a different beast.

replies(1): >>42190047 #
3. pama ◴[] No.42190047[source]
All HPC and LLMs tend to get fully optimized to their hardware specs. When you train models with over 405B parameters and process about 2 million tokens per second calculating derivatives on all these parameters every few seconds, you do end up at the boundary of latency and bandwidth at all scales (from host to host, host to device, and the multiple rates within each device). Typical LLM training at these scales multiplexes three or more different types of parallelism to avoid keeping the devices idle and of course they have to also deal with redundancy and frequent failures of these erratic hardwares (if a single H100 fails once every five years, 100K of them would have more than two failures per hour.)