←back to thread

195 points rbanffy | 1 comments | | HN request time: 0.208s | source
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
zekrioca ◴[] No.42177493[source]
The Top500 list is useful as a public, standardized baseline that is straightforward, with a predicted periodicity for more than 30 years. It is trickier to compare cloud infras due to their heterogeneity, fast pace, and more importantly, due the lack of standardized tests, although the MLCommons [1] have been very keen on helping with that.

[1] https://mlcommons.org/datasets/

replies(1): >>42178610 #
makeitdouble ◴[] No.42178610[source]
If I understand your comment correctly, we're taking a stable but not that relevant metric, because the real players of the market are too secretive, fast and far ahead to allow for simple comparisons.

From a distance, it kinda sounds like listening to kids brag about their allowance while the adults don't want to talk about their salary, and try to draw wider conclusions from there.

replies(2): >>42178935 #>>42179876 #
1. zekrioca ◴[] No.42179876[source]
It seems there was a misunderstanding, as I haven't made any value judgment about LINPACK.

Yes, LINPACK is indeed "old" with a heavy focus on compute power. However, its simplicity serves as a reliable baseline for the types of workflows that supercomputers are designed to handle. Also, at their core, most AI workloads perform essentially the same operations as HPC, albeit with less stability—which, I admit, is a feature, but likely the reason AI-focused systems do not prioritize LINPACK as much.

I am simply saying that any useful metric needs to not only be "stable", but also simple to grasp. Take Green500, probably a significant benchmark for understanding how algorithms consume power, but "too complex" to explain: yet, many cloud providers with their AI supercomputers avoid competing against HPC supercomputers in this domain.

This avoidance isn’t necessarily due to secrecy but rather inefficiencies inherent to cloud systems. Consider PUE (Power Usage Effectiveness)—a highly misleading metric that cloud providers frequently tout. PUE can easily be manipulated, especially with the use of liquid cooling, which is why optimizing for it has become a major factor contributing to water disruptions in several large cities worldwide.