←back to thread

195 points rbanffy | 2 comments | | HN request time: 0.001s | source
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
danpalmer ◴[] No.42176948[source]
Google is running its own TPU hardware for internal workloads. I believe Nvidia is just resold for cloud customers.
replies(3): >>42177022 #>>42178089 #>>42178914 #
1. deeth_starr_v ◴[] No.42178914[source]
Not true. Apple trained some models on their TPU
replies(1): >>42178931 #
2. danpalmer ◴[] No.42178931[source]
Apologies, to be clear what I meant was that to my knowledge Google doesn't use GPUs for it's own stuff, but does sell both TPUs and GPUs to others on Cloud.

Also, to be clear, I have no internal info about this, I'm going based on external stuff I've seen.