←back to thread

195 points rbanffy | 1 comments | | HN request time: 0s | source
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
zekrioca ◴[] No.42177493[source]
The Top500 list is useful as a public, standardized baseline that is straightforward, with a predicted periodicity for more than 30 years. It is trickier to compare cloud infras due to their heterogeneity, fast pace, and more importantly, due the lack of standardized tests, although the MLCommons [1] have been very keen on helping with that.

[1] https://mlcommons.org/datasets/

replies(1): >>42178610 #
makeitdouble ◴[] No.42178610[source]
If I understand your comment correctly, we're taking a stable but not that relevant metric, because the real players of the market are too secretive, fast and far ahead to allow for simple comparisons.

From a distance, it kinda sounds like listening to kids brag about their allowance while the adults don't want to talk about their salary, and try to draw wider conclusions from there.

replies(2): >>42178935 #>>42179876 #
wbl ◴[] No.42178935[source]
Even the DoE posts top 500 results when they commission a supercomputer.
replies(1): >>42179577 #
makeitdouble ◴[] No.42179577[source]
DoE has absolutely no incentive (nor need, I'd argue) to compare their supercomputers to commercially owned data center operations though.

Comparing their crazy expensive custom built HPC to massive arrays of customer grade hardware doesn't bring them additional funds, nor help them more PR wise than being the owner of the fastest individual clusters.

Being at the top of some heap is visibly one of their goal:

https://www.energy.gov/science/high-performance-computing

replies(1): >>42180151 #
1. khm ◴[] No.42180151{3}[source]
DOE clusters are also massive arrays of customer grade hardware. Private cloud can only keep up in low precision work, and that is why they're still playing with remote memory access over TCP, because it's good enough for web and ML.

High precision HPC exists in the private cloud, but you only hear "we don't want to embarrass others" excuses because otherwise you would be able to calculate the cost.

On prem HPC is still very, very much cheaper than hiring out.