Most active commenters
  • pama(5)

←back to thread

93 points rbanffy | 16 comments | | HN request time: 0.817s | source | bottom
1. pama ◴[] No.42188372[source]
Noting here that 2700 quadrillion operations per second is less than the estimated sustained throughput of productive bfloat16 compute during the training of the large llama3 models, which IIRC was about 45% of 16,000 quadrillion operations per second, ie 16k H100 in parallel at about 0.45 MFU. The compute power of national labs has fallen far behind industry in recent years.
replies(3): >>42188382 #>>42188389 #>>42188415 #
2. handfuloflight ◴[] No.42188382[source]
Any idea how that stacks up with GPT-4?
replies(1): >>42189906 #
3. alephnerd ◴[] No.42188389[source]
Training an LLM (basically Transformers) is different workflow from Nuclear Simulations (basically Monte Carlo simulations)

There are a lot of intricates, but at a high level they require different compute approaches.

replies(3): >>42188413 #>>42188417 #>>42188497 #
4. handfuloflight ◴[] No.42188413[source]
Can you expand on why the operations per second is not an apt comparison?
replies(1): >>42188538 #
5. bryanlarsen ◴[] No.42188415[source]
A 64 bit float operation is >4X as expensive as a 16 bit float operation.
replies(2): >>42188503 #>>42188504 #
6. pama ◴[] No.42188417[source]
Absolutely. Though the performance of El Capitain is only measured by a linpack benchmark not the actual application.
replies(1): >>42188515 #
7. Koshkin ◴[] No.42188497[source]
This is about the raw compute, no matter the workflow.
replies(1): >>42193796 #
8. Koshkin ◴[] No.42188503[source]
In terms of heat dissipation, maybe, yes. But not necessarily in time.
9. pama ◴[] No.42188504[source]
Agreed. However also note that if it was only matrix multiplies and no full transformer training, the performance of that Meta cluster would be closer to 16k PFlops/s, still much faster than the El Capitain performance measured on linpack and multiplied by 4. Other companies presumably cabled 100k H100s together, but they dont yet publish training data for their LLMs. It is good to have competition, I just didnt expect the tables to flip so dramatically over the last two decades from a time when governments still ruled the top spots in computer centers with ease to nowadays where the assumption is that there are at least ten companies with larger clusters than the most powerful governments.
replies(1): >>42189800 #
10. pertymcpert ◴[] No.42188515{3}[source]
I thought modern supercomputers use benchmarks like HPCG instead of LINPACK?
replies(1): >>42188963 #
11. pertymcpert ◴[] No.42188538{3}[source]
When you're doing scientific simulations, you're generally a lot more sensitive to FP precision than ML training which is very, very tolerant of reduced precision. So while FP8 might be fine for transformer networks, it would likely be unacceptably inaccurate/unusable for simulations.
12. fancyfredbot ◴[] No.42188963{4}[source]
The top 500 includes both. There is no HPCG result for El Capitan yet:

https://top500.org/lists/hpcg/2024/11/

13. sliken ◴[] No.42189800{3}[source]
I'd expect linpack to be much closer to a user research application than training LLMs. My understanding of LLMs is that it's more about throughput and has a very predictable communication patterns, not latency sensitive, and bandwidth intensive.

Most parallel research, especially at this scale is more about different balance of operations to memory bandwidth, and much more worried about interconnect latency.

I wouldn't assume that just because various corporations have large training clusters that they could dominate HPC if they wanted to. Hyperscalers have dominated throughput for many years now, but HPC is a different beast.

replies(1): >>42190047 #
14. pama ◴[] No.42189906[source]
If I knew, I wouldn’t be able to disclose it :-)
15. pama ◴[] No.42190047{4}[source]
All HPC and LLMs tend to get fully optimized to their hardware specs. When you train models with over 405B parameters and process about 2 million tokens per second calculating derivatives on all these parameters every few seconds, you do end up at the boundary of latency and bandwidth at all scales (from host to host, host to device, and the multiple rates within each device). Typical LLM training at these scales multiplexes three or more different types of parallelism to avoid keeping the devices idle and of course they have to also deal with redundancy and frequent failures of these erratic hardwares (if a single H100 fails once every five years, 100K of them would have more than two failures per hour.)
16. alephnerd ◴[] No.42193796{3}[source]
It isn't. I recommend reading u/pertymcpert's response.