You are right, eventually something's gotta give. The path for this next leg isn't yet apparent to me.
P.s. how much is an exaflop or petaflop, and how significant is it? The numbers thrown around in this article don't mean anything to me. Is this new cluster way more powerful than the last top?
Most customers care about cost-effectiveness more than best-in-class raw-performance, a fact that AMD has ruthlessly exploited over the past 8 years. It helps that AMD products are occasionally both.
I know a lot developing on apples silicon and just pushing it to clusters for bigger runs. So why not run it on an apple GPU there?
Also, of the top 10, AMD has 5 systems.
1 petaflop = 10^15 flops = 1,000,000,000,000,000 flops.
1 exaflop = 10^18 flops = 1,000,000,000,000,000,000 flops.
Note that these are simply powers of 10, not powers of 2, which are used for storage for example.
For everything that isn't machine learning, I frankly feel like it's the other way around. Apple's "solution" to these edge cases is telling people to write compute shaders that you could write in Vulkan or DirectX instead. What sets CUDA apart is an integration with a complex acceleration pipeline that Apple gave up trying to replicate years ago.
When cryptocurrency mining was king-for-a-day, everyone rushed out to buy Nvidia hardware because it supported accelerated crypto well from the start. The same thing happened with the AI and machine learning boom. Apple and AMD were both late to the party and wrongly assumed that NPU hardware would provide a comparable solution. Without a CUDA competitor, Apple would struggle more than AMD to find market fit.
Nominally, a measurement in "flops" is how many (typically 32-bit) FLoating-point Operations Per Second the hardware is capable of performing, so it's an approximate measure of total available computing power.
A high-end consumer-grade CPU can achieve on the order of a few hundred gigaflops (let's say 250, just for a nice round number). https://boinc.bakerlab.org/rosetta/cpu_list.php
A petaflop is therefore about four thousand of those; multiply by another thousand to get an exaflop.
For another point of comparison, a high-end GPU might be on the order of 40-80 teraflops. https://www.tomshardware.com/reviews/gpu-hierarchy,4388-2.ht...
NVidia currently has 80-90% gross margins on their LLM GPUs, that’s all the incentive another company needs to invest money into a CUDA alternative.
Which doesn’t help with understanding how much more impressive these are than the last clusters, but does to me at least put the amount of compute these clusters have into focus.
My point of reference is that back in undergrad (~10-15 years ago), I recall a class assignment where we had to optimize matrix multiplication on a CPU; typical good parallel implementations achieved about 100-130 gigaflops (on a... Nehalem or Westmere Xeon, I think?).
The code to run these things on apples GPUs exist and is used every day! I don't know anyone using AMD GPUs, but pretty often its nvidia on the cluster and Apple on the laptop. So if nvidia is making these juicy profits, i think apple could seriously think about moving to the cluster if it wants to.
The reason why AMD is behind is that it is behind in hardware. MI300x is more pricey per hour in all the cloud I can find compared to H100, and the MFU is order of magnitude lower compared to NVIDIA for transformers, even though transformers are fully supported. And I get same 40-50% MFU in TPU for the same code. If anyone is investing >10 million dollar for hardware, they sure can invest a million dollar to rewrite everything in whatever language AMD asks them to if it is cheaper.
Which does make the clusters a fair bit less impressive, but also a lot more sensibly sized.
https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid...
But at these levels of compute, the memory/interconnect bandwidth becomes the bottleneck.
You need to develop your own in house solution to distributing workloads.
The difference to regular clusters is that all the memory is globally visible, so machine 0023 can access and modify address 0x0123456789abcdef0123456789abcdef which happens to be on machine 0999.
e.g. you could run a H100 at 100% utilization 24/7 for 1 years at $0.4 per kWh (so assuming significant overhead for infrastructure etc.) and that would only cost ~10% of the purchase price of the GPU itself.
The Pytorch MPS patches are a fun appeasement for developers, but they didn't unthrone Nvidia's demand. They didn't beat Nvidia on performance per watt, they didn't match their price, their scale or CUDA's featureset, and they don't even provide basic server drivers. It's got nothing to do with what brand you prefer and everything to do with what makes actual sense in a datacenter. Apple can't take on Nvidia clusters without copying Nvidia's current architecture - Apple Silicon's current architecture is too inefficient to be a serious replacement to Nvidia clusters.
If Apple wanted to have a shot at entering the cluster game, that window of opportunity closed when Apple Silicon converged on simplified GPU designs. The 2w NPUs and compute shaders aren't going to make Nvidia scared, let alone compete with AMD's market share.
And of course there is some serious amount of money sloshing around in this space. Things being hard doesn't mean it's impossible. And there's no shortage of extremely well funded companies working on this stuff. All your favorite trillion $ companies basically. And most of them have their own AI chips too. And probably some reservations about perpetually handing a lot of their cash to Nvidia.
If you want an example of a company that used to have a gigantic moat that is now dealing with a lot of competition, look at Intel. X86 used to be that moat. And that's looking pretty weak lately. One reason that AMD is in the news a lot lately is that they are growing at Intel's expense. Nvidia might be their next target.
We can increase that another 2x and the cost would still be relatively low compared to the price/deprecation of the GPU itself.
You are mostly listing irrelevant nice to have things that aren't deal breakers. AMD's consumer GPUs have a long history of being abandoned a year or two after release.
Coupled with Khronos, Intel, AMD never delivering anything comparable with OpenCL, Apple losing interest after Khronos didn't took OpenCL into the direction they wanted, Google never adopting it favouring their Renderscript dialect.
As for Fortran, that doesn't come up much in modern AI stuff. I haven't observed PTX / GCN assembly within AI codebases but maybe you have extra insight there.