←back to thread

195 points rbanffy | 4 comments | | HN request time: 0.001s | source
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
llm_trw ◴[] No.42178187[source]
A cluster is not a super computer.

The whole point of a super computer is that it act as much as a single machine as it is possible while a cluster is a soup of nearly independent machines.

replies(3): >>42178234 #>>42179465 #>>42180953 #
almostgotcaught ◴[] No.42179465[source]
i wish people wouldn't make stuff up just to sound cool.

like do you have actual experience with gov/edu HPC? i doubt it because you couldn't be more wrong - lab HPC clusters are just very very poorly (relative to FAANG) strewn together nodes. there is absolutely no sense in which they are "one single machine" (nothing is "abstracted over" except NFS).

what you're saying is trivially false because no one ever requests all the machines at once (except when they're running linpack to produce top500 numbers). the rest of the time the workflow is exactly like in any industrial cluster: request some machines (through slurm), get those machines, run your job (hopefully you distributed the job across the nodes correctly), release those machines. if i still had my account i could tell you literally how many different jobs are running right now on polaris.

replies(1): >>42179716 #
bocklund ◴[] No.42179716[source]
Actually, LLNL (the site of El Capitan) has a process for requesting Dedicated Application Time (a DAT) where you use up to a whole machine, usually over a weekend. They occur fairly regularly. Mostly it's lots of individual users and jobs, like you said though.
replies(1): >>42180282 #
almostgotcaught ◴[] No.42180282{3}[source]
> where you use up to a whole machine

i mean rick stevens et al can grab all of polaris too but even so - it's just a bunch of nodes and you're responsible for distributing your work across those nodes efficiently. there's no sense in which it's a "single computer" in any way, shape or form.

replies(1): >>42180326 #
llm_trw ◴[] No.42180326{4}[source]
The same way that you're responsible for distributing your single threaded code between cores on your desktop.
replies(2): >>42182694 #>>42183140 #
1. almostgotcaught ◴[] No.42183140{5}[source]
Tell me you've never run a distributed workload without telling me. You realize if what you were saying were true, HPC would be trivial. In fact it takes a whole lot of PhDs to manage the added complexity because it's not just a "single computer".
replies(1): >>42183388 #
2. llm_trw ◴[] No.42183388[source]
If you think parallelizing single threaded code is trivial ... well there's nothing else to say really.
replies(1): >>42183483 #
3. almostgotcaught ◴[] No.42183483[source]
Is there like a training program available for learning how to be this obstinate? I would love to attend so that I can win fights with my wife.
replies(1): >>42187852 #
4. davrosthedalek ◴[] No.42187852{3}[source]
Maybe llm_trw is your wife?