←back to thread

195 points rbanffy | 6 comments | | HN request time: 0.001s | source | bottom
Show context
ipsum2 ◴[] No.42176882[source]
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

replies(10): >>42176948 #>>42177276 #>>42177493 #>>42177581 #>>42177611 #>>42177644 #>>42178095 #>>42178187 #>>42178825 #>>42179038 #
llm_trw ◴[] No.42178187[source]
A cluster is not a super computer.

The whole point of a super computer is that it act as much as a single machine as it is possible while a cluster is a soup of nearly independent machines.

replies(3): >>42178234 #>>42179465 #>>42180953 #
almostgotcaught ◴[] No.42179465[source]
i wish people wouldn't make stuff up just to sound cool.

like do you have actual experience with gov/edu HPC? i doubt it because you couldn't be more wrong - lab HPC clusters are just very very poorly (relative to FAANG) strewn together nodes. there is absolutely no sense in which they are "one single machine" (nothing is "abstracted over" except NFS).

what you're saying is trivially false because no one ever requests all the machines at once (except when they're running linpack to produce top500 numbers). the rest of the time the workflow is exactly like in any industrial cluster: request some machines (through slurm), get those machines, run your job (hopefully you distributed the job across the nodes correctly), release those machines. if i still had my account i could tell you literally how many different jobs are running right now on polaris.

replies(1): >>42179716 #
bocklund ◴[] No.42179716[source]
Actually, LLNL (the site of El Capitan) has a process for requesting Dedicated Application Time (a DAT) where you use up to a whole machine, usually over a weekend. They occur fairly regularly. Mostly it's lots of individual users and jobs, like you said though.
replies(1): >>42180282 #
almostgotcaught ◴[] No.42180282[source]
> where you use up to a whole machine

i mean rick stevens et al can grab all of polaris too but even so - it's just a bunch of nodes and you're responsible for distributing your work across those nodes efficiently. there's no sense in which it's a "single computer" in any way, shape or form.

replies(1): >>42180326 #
llm_trw ◴[] No.42180326[source]
The same way that you're responsible for distributing your single threaded code between cores on your desktop.
replies(2): >>42182694 #>>42183140 #
1. davrosthedalek ◴[] No.42182694[source]
No. Threads run typically in the same address space. HPC processes on different nodes typically do not.
replies(1): >>42183490 #
2. llm_trw ◴[] No.42183490[source]
Define address space.

Cache is not shared between cores.

HPCs just have more levels of cache.

Lest you ignore the fact that infiniband is pretty much on par with top of the line ddr speeds for the matching generation.

replies(4): >>42183627 #>>42184007 #>>42185261 #>>42185697 #
3. davrosthedalek ◴[] No.42183627[source]
Really? How about: "This pointer is valid, has the same numeric value (address) and points to the same data in all threads". The point is not the latency nor bandwidth. The point is the programming/memory model. Infiniband maybe makes multiprocessing across nodes as fast as multiprocessing on a single node. But it's not multithreading.
4. imtringued ◴[] No.42184007[source]
>Cache is not shared between cores.

I feel sorry for you if you believe this. It's not true physically nor is it true on the level of the cache coherence protocol nor is it true from the perspective of the operating system.

5. formerly_proven ◴[] No.42185261[source]
There are four sentences in your comment.

None of them logically relate to another.

One is a question.

And the rest are wrong.

6. moralestapia ◴[] No.42185697[source]
>Lest you ignore the fact that infiniband is pretty much on par with top of the line ddr speeds for the matching generation.

You can't go faster than the speed of light (yet) and traveling a few micrometers will always be much faster than traversing a room (plus routing and switching).

Many HPC tasks nowadays are memory-bound rather than CPU-bound, memory-latency-and-throughput-bound to be more precise. An actual supercomputer would be something like the Cerebras chip, a lot of the performance increase you get is due to having everything on-chip at a given time.