←back to thread

195 points rbanffy | 1 comments | | HN request time: 0.207s | source
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
llm_trw ◴[] No.42178187[source]
A cluster is not a super computer.

The whole point of a super computer is that it act as much as a single machine as it is possible while a cluster is a soup of nearly independent machines.

replies(3): >>42178234 #>>42179465 #>>42180953 #
1. bravetraveler ◴[] No.42180953[source]
Put slurm on it, bam. Supercomputer.