←back to thread

195 points rbanffy | 2 comments | | HN request time: 0.001s | source
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
pclmulqdq ◴[] No.42177276[source]
B200s have an incremental increase in FP64 and FP32 performance over H100s. That is the number format that HPC people care about.

The MI300A can get to 150% the FP64 peak performance that B200 devices can get, although AMD GPUs have historically underperformed their spec more than Nvidia GPUs. It's possible that B200 devices are actually behind for HPC.

replies(1): >>42177364 #
1. cayleyh ◴[] No.42177364[source]
Top line comparison numbers for reference: https://www.theregister.com/2024/03/18/nvidia_turns_up_the_a...

It does seem like Nvidia is prioritizing int8 / fp8 performance over FP64, which given the current state of the ML marketplace is a great idea.

replies(1): >>42178086 #
2. nextos ◴[] No.42178086[source]
MI300 also have decent performance in FP16 (~108 TFLOPS). Not as good as NVIDIA, but it's getting there. Anyone has experience using these on JAX? Support is said to be decent, but no idea if it's good enough for research-oriented tasks, i.e. stable enough for training and inference.