AMD now has more compute on the top 500 than Nvidia

(www.nextplatform.com)

Show context

ipsum2 ◴[18 Nov 24 20:52 UTC] No.42176882[source]▶

As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #

1. zekrioca ◴[18 Nov 24 21:51 UTC] No.42177493[source]▶

>>42176882 #

The Top500 list is useful as a public, standardized baseline that is straightforward, with a predicted periodicity for more than 30 years. It is trickier to compare cloud infras due to their heterogeneity, fast pace, and more importantly, due the lack of standardized tests, although the MLCommons [1] have been very keen on helping with that.

[1] https://mlcommons.org/datasets/

replies(1): >>42178610 #

2. makeitdouble ◴[18 Nov 24 23:56 UTC] No.42178610[source]▶

>>42177493 (TP) #

If I understand your comment correctly, we're taking a stable but not that relevant metric, because the real players of the market are too secretive, fast and far ahead to allow for simple comparisons.

From a distance, it kinda sounds like listening to kids brag about their allowance while the adults don't want to talk about their salary, and try to draw wider conclusions from there.

replies(2): >>42178935 #>>42179876 #

3. wbl ◴[19 Nov 24 00:37 UTC] No.42178935[source]▶

>>42178610 #

Even the DoE posts top 500 results when they commission a supercomputer.

replies(1): >>42179577 #

4. makeitdouble ◴[19 Nov 24 02:32 UTC] No.42179577{3}[source]▶

>>42178935 #

DoE has absolutely no incentive (nor need, I'd argue) to compare their supercomputers to commercially owned data center operations though.

Comparing their crazy expensive custom built HPC to massive arrays of customer grade hardware doesn't bring them additional funds, nor help them more PR wise than being the owner of the fastest individual clusters.

Being at the top of some heap is visibly one of their goal:

https://www.energy.gov/science/high-performance-computing

replies(1): >>42180151 #

5. zekrioca ◴[19 Nov 24 03:31 UTC] No.42179876[source]▶

>>42178610 #

It seems there was a misunderstanding, as I haven't made any value judgment about LINPACK.

Yes, LINPACK is indeed "old" with a heavy focus on compute power. However, its simplicity serves as a reliable baseline for the types of workflows that supercomputers are designed to handle. Also, at their core, most AI workloads perform essentially the same operations as HPC, albeit with less stability—which, I admit, is a feature, but likely the reason AI-focused systems do not prioritize LINPACK as much.

I am simply saying that any useful metric needs to not only be "stable", but also simple to grasp. Take Green500, probably a significant benchmark for understanding how algorithms consume power, but "too complex" to explain: yet, many cloud providers with their AI supercomputers avoid competing against HPC supercomputers in this domain.

This avoidance isn’t necessarily due to secrecy but rather inefficiencies inherent to cloud systems. Consider PUE (Power Usage Effectiveness)—a highly misleading metric that cloud providers frequently tout. PUE can easily be manipulated, especially with the use of liquid cooling, which is why optimizing for it has become a major factor contributing to water disruptions in several large cities worldwide.

6. khm ◴[19 Nov 24 04:44 UTC] No.42180151{4}[source]▶

>>42179577 #

DOE clusters are also massive arrays of customer grade hardware. Private cloud can only keep up in low precision work, and that is why they're still playing with remote memory access over TCP, because it's good enough for web and ML.

High precision HPC exists in the private cloud, but you only hear "we don't want to embarrass others" excuses because otherwise you would be able to calculate the cost.

On prem HPC is still very, very much cheaper than hiring out.

↑