←back to thread

195 points rbanffy | 6 comments | | HN request time: 0.632s | source | bottom
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
1. maratc ◴[] No.42177611[source]
> Nvidia B200s ... offer 2-3x the performance of H100s

For ML, not for HPC. ML and HPC are two completely different, only loosely related fields.

ML tasks are doing great with low precision, 16 and 8 bit precision is fine, arguably good results can be achieved even with 4 bit precision [0][1]. That won't do for HPC tasks, like predicting global weather, computational biology, etc. -- one would need 64 to 128 bit precision for that.

Nvidia needs to decide how to divide the billions of transistors on their new silicon. Greatly oversimplifying, they can choose to make one of the following:

  *  Card A with *n* FP64 cores, or 
  *  Card B with *2n* FP32 cores, or 
  *  Card C with *4n* FP16 cores, or 
  *  Card D with *8n* FP8 cores, or (theoretically)
  *  Card E with *16n* FP4 cores (not sure if FP4 is a thing). 
Card A would give HPC guys n usable cores, and it would give ML guys n usable cores. On the other end, Card E would give ML guys 16n usable cores (and zero usable cores for HPC guys). It's no wonder that HPC crowd wants Nvidia to produce Card A, while ML crowd wants Nvidia to produce Card E. Given that all the hype and the money are currently with the ML guys (and $NVDA reflects that), Nvidia will make a combination of different cores that is much much closer to Card E than it is to Card A.

Their new offerings are arguably worse than their older offerings for HPC tasks, and the feeling with the HPC crowd is that "Nvidia and AMD are in the process of abandoning this market".

[0] https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...

[1] https://arxiv.org/abs/2212.09720

replies(5): >>42178357 #>>42178713 #>>42179347 #>>42180055 #>>42185923 #
2. touisteur ◴[] No.42178357[source]
With the B100 somehow announced to have lower scalar FP64 throughput than the H100 (did they remove the DP tensor cores ?), one will have to rely on Ozaki schemes (dgemm with int8 tensor cores) and lots of the recent body of work on mixed-precision linear algebra show there's a lot of computing power to be harnessed from Tensor Cores. One of the problems of HPC now is a level of ossification of some codebases (or the lack of availability of porting/coding/optimizing people). You shouldn't have to rewrite everything every 5 years but the hardware constructors go where they go and we still haven't found the right level of abstraction to avoid big porting efforts.
3. ipsum2 ◴[] No.42178713[source]
Yes, that's a great point that I missed. From anecdotal evidence, it seems more people are using supercomputers for ML use cases, that would have been traditionally done by HPC. (eg training models for weather forecasts)
4. layla5alive ◴[] No.42179347[source]
You've heard of SIMD - it's possible to do both, in terms of throughput, with instruction/scheduler/port complexity overhead of course.
5. dragontamer ◴[] No.42180055[source]
Doesn't multiply area scale at O(n^2 * log(n)) ?? (At least, I'm pretty sure the Wallace Tree Multiplier circuit is somewhere in that order).

So a 64-bit multiplier is something like 32x more area than a 16-bit multiplier.

But what you say is correct for RAM area or the number of bits you need for register space. So taken holistically, it's difficult to say...

Okay, 64-bit FP is only like 53-bits and 16-bit FP is actually like 11 bits. But you know what I mean. I'm still doing quick napkin math here, nothing formal.

-------

We can ignore adders and subtractor circuits because they are so small. Division is often implemented as reciprocal followed by multiplication circuits for floating point (true division is very expensive).

6. ◴[] No.42185923[source]