Most active commenters

pjmlp(7)
talldayo(3)
Wytwwww(3)

Popular/hot comments

>>42176440 #
>>42176567 #

←back to thread

AMD now has more compute on the top 500 than Nvidia

(www.nextplatform.com)

1. pie420 ◴[18 Nov 24 20:14 UTC] No.42176400[source]▶

>>42175624 (OP) #

layperson with no industry knowledge, but it seems like nvidia's CUDA moat will fall in the next 2-5 years. It seems impossible to sustain those margins without competition coming in and getting a decent slice of the pie

replies(5): >>42176440 #>>42177575 #>>42177944 #>>42178259 #>>42179625 #

2. metadat ◴[18 Nov 24 20:17 UTC] No.42176440[source]▶

>>42176400 (TP) #

But how will AMD or anyone else push in? CUDA is actually a whole virtualization layer on top of the hardware and isn't easily replicable, Nvidia has been at it for 17 years.

You are right, eventually something's gotta give. The path for this next leg isn't yet apparent to me.

P.s. how much is an exaflop or petaflop, and how significant is it? The numbers thrown around in this article don't mean anything to me. Is this new cluster way more powerful than the last top?

replies(14): >>42176567 #>>42176711 #>>42176809 #>>42177061 #>>42177287 #>>42177319 #>>42177378 #>>42177451 #>>42177452 #>>42177477 #>>42177479 #>>42178108 #>>42179870 #>>42180214 #

3. bryanlarsen ◴[18 Nov 24 20:27 UTC] No.42176567[source]▶

>>42176440 #

Anybody spending tens of billions annually on Nvidia hardware is going to be willing to spend millions to port their software away from CUDA.

replies(3): >>42176963 #>>42177463 #>>42182571 #

4. vlovich123 ◴[18 Nov 24 20:39 UTC] No.42176711[source]▶

>>42176440 #

The API part isn't thaaat hard. Indeed HIP already works pretty well at getting existing CUDA code to work unmodified on AMD HW. The bigger challenge is that the AMD and Nvidia architectures are so different that the optimization choices for what the kernels would look like are more different between Nvidia and AMD than they would be between Intel and AMD in CPU land even including SIMD.

replies(1): >>42182561 #

5. sangnoir ◴[18 Nov 24 20:46 UTC] No.42176809[source]▶

>>42176440 #

CUDA is the assembly to Torch's high-level language; for most, it's a very good intermediary, but an intermediary nonetheless, as it is between the actual code they are interested in, and the hardware that runs it.

Most customers care about cost-effectiveness more than best-in-class raw-performance, a fact that AMD has ruthlessly exploited over the past 8 years. It helps that AMD products are occasionally both.

replies(1): >>42182611 #

6. echelon ◴[18 Nov 24 20:59 UTC] No.42176963{3}[source]▶

>>42176567 #

For the average non-FAANG company, there's nothing to port to yet. We don't all have the luxury of custom TPUs.

7. LeanderK ◴[18 Nov 24 21:08 UTC] No.42177061[source]▶

>>42176440 #

its possible. Just look at Apples GPU, its mostly supported by torch, what's left are mostly edge-cases. Apple should make a datacenter GPU :D that would be insanely funny. It's actually somewhat well positioned as, due to the MacBooks, the support is already there. I assume here that most things translate to linux, as I don't think you can sell MacOS in the cloud :D

I know a lot developing on apples silicon and just pushing it to clusters for bigger runs. So why not run it on an apple GPU there?

replies(2): >>42177409 #>>42178483 #

8. stonemetal12 ◴[18 Nov 24 21:30 UTC] No.42177287[source]▶

>>42176440 #

According to Wikipedia the previous #1 was from 2022 with a peak petaflops of 2,055. This system is rated at 2,746. So about 33% faster than the old #1.

Also, of the top 10, AMD has 5 systems.

https://en.wikipedia.org/wiki/TOP500

9. smokel ◴[18 Nov 24 21:33 UTC] No.42177319[source]▶

>>42176440 #

> P.s. how much is an exaflop or petaflop

1 petaflop = 10^15 flops = 1,000,000,000,000,000 flops.

1 exaflop = 10^18 flops = 1,000,000,000,000,000,000 flops.

Note that these are simply powers of 10, not powers of 2, which are used for storage for example.

10. fweimer ◴[18 Nov 24 21:38 UTC] No.42177378[source]▶

>>42176440 #

Isn't porting software to the next generation supercomputer pretty standard for HPC?

11. talldayo ◴[18 Nov 24 21:41 UTC] No.42177409{3}[source]▶

>>42177061 #

> what's left are mostly edge-cases.

For everything that isn't machine learning, I frankly feel like it's the other way around. Apple's "solution" to these edge cases is telling people to write compute shaders that you could write in Vulkan or DirectX instead. What sets CUDA apart is an integration with a complex acceleration pipeline that Apple gave up trying to replicate years ago.

When cryptocurrency mining was king-for-a-day, everyone rushed out to buy Nvidia hardware because it supported accelerated crypto well from the start. The same thing happened with the AI and machine learning boom. Apple and AMD were both late to the party and wrongly assumed that NPU hardware would provide a comparable solution. Without a CUDA competitor, Apple would struggle more than AMD to find market fit.

replies(1): >>42177935 #

12. ok123456 ◴[18 Nov 24 21:46 UTC] No.42177451[source]▶

>>42176440 #

People have been chipping away at this for a while. HIP allows source-level translation, and libraries like Jax provide a HIP version.

13. vitus ◴[18 Nov 24 21:46 UTC] No.42177452[source]▶

>>42176440 #

> P.s. how much is an exaflop or petaflop, and how significant is it? The numbers thrown around in this article don't mean anything to me. Is this new cluster way more powerful than the last top?

Nominally, a measurement in "flops" is how many (typically 32-bit) FLoating-point Operations Per Second the hardware is capable of performing, so it's an approximate measure of total available computing power.

A high-end consumer-grade CPU can achieve on the order of a few hundred gigaflops (let's say 250, just for a nice round number). https://boinc.bakerlab.org/rosetta/cpu_list.php

A petaflop is therefore about four thousand of those; multiply by another thousand to get an exaflop.

For another point of comparison, a high-end GPU might be on the order of 40-80 teraflops. https://www.tomshardware.com/reviews/gpu-hierarchy,4388-2.ht...

replies(1): >>42179813 #

14. talldayo ◴[18 Nov 24 21:47 UTC] No.42177463{3}[source]▶

>>42176567 #

To slower hardware? What are they supposed to port to, ASICs?

replies(1): >>42177525 #

15. quickthrowman ◴[18 Nov 24 21:49 UTC] No.42177477[source]▶

>>42176440 #

> But how will AMD or anyone else push in? CUDA is actually a whole virtualization layer on top of the hardware and isn't easily replicable, Nvidia has been at it for 17 years.

NVidia currently has 80-90% gross margins on their LLM GPUs, that’s all the incentive another company needs to invest money into a CUDA alternative.

16. NineStarPoint ◴[18 Nov 24 21:49 UTC] No.42177479[source]▶

>>42176440 #

A high grade consumer gpu a (a 4090) is about 80 teraflops. So rounding up to 100, an exaflop is about 10,000 consumer grade cards worth of compute, and a petaflop is about 10.

Which doesn’t help with understanding how much more impressive these are than the last clusters, but does to me at least put the amount of compute these clusters have into focus.

replies(2): >>42177621 #>>42177989 #

17. adgjlsfhk1 ◴[18 Nov 24 21:54 UTC] No.42177525{4}[source]▶

>>42177463 #

if the hardware is 30% slower and 2x cheaper, that's a pretty great deal.

replies(1): >>42177861 #

18. latchkey ◴[18 Nov 24 21:58 UTC] No.42177575[source]▶

>>42176400 (TP) #

We donated one of our MI300x systems to the SCALE team. The moat-less future is coming more quickly than you think.

https://scale-lang.com/

19. vitus ◴[18 Nov 24 22:05 UTC] No.42177621{3}[source]▶

>>42177479 #

You're off by three orders of magnitude.

My point of reference is that back in undergrad (~10-15 years ago), I recall a class assignment where we had to optimize matrix multiplication on a CPU; typical good parallel implementations achieved about 100-130 gigaflops (on a... Nehalem or Westmere Xeon, I think?).

replies(2): >>42177945 #>>42178222 #

20. selectodude ◴[18 Nov 24 22:37 UTC] No.42177861{5}[source]▶

>>42177525 #

Power density tends to be the limiting factor for this stuff, not money. If it's 30 percent slower per watt, it's useless.

replies(1): >>42178459 #

21. LeanderK ◴[18 Nov 24 22:47 UTC] No.42177935{4}[source]▶

>>42177409 #

well, but machine learning is the major reason we use GPUs in the datacenter (not talking about consumer GPUs here). The others are edge-cases for data-centre applications! Apple is uniquely positioned exactly because it is already solved due to a significant part of the ML-engineers using MacBooks to develop locally.

The code to run these things on apples GPUs exist and is used every day! I don't know anyone using AMD GPUs, but pretty often its nvidia on the cluster and Apple on the laptop. So if nvidia is making these juicy profits, i think apple could seriously think about moving to the cluster if it wants to.

replies(1): >>42179042 #

22. YetAnotherNick ◴[18 Nov 24 22:47 UTC] No.42177944[source]▶

>>42176400 (TP) #

CUDA moat is highly overrated for AI in the first place and sold as the reason for the failure of AMD. Almost no one in AI uses CUDA. They only use pytorch or Triton. TPUs didn't face lot of hurdle due to CUDA because they were initially better in terms of price to performance and supported pytorch, tensorflow and jax.

The reason why AMD is behind is that it is behind in hardware. MI300x is more pricey per hour in all the cloud I can find compared to H100, and the MFU is order of magnitude lower compared to NVIDIA for transformers, even though transformers are fully supported. And I get same 40-50% MFU in TPU for the same code. If anyone is investing >10 million dollar for hardware, they sure can invest a million dollar to rewrite everything in whatever language AMD asks them to if it is cheaper.

replies(1): >>42190848 #

23. NineStarPoint ◴[18 Nov 24 22:47 UTC] No.42177945{4}[source]▶

>>42177621 #

You are 100% correct, I lost a full prefix of performance there. Edited my message.

Which does make the clusters a fair bit less impressive, but also a lot more sensibly sized.

24. winwang ◴[18 Nov 24 22:51 UTC] No.42177989{3}[source]▶

>>42177479 #

4090 tensor performance (FP8): 660 teraflops, 1320 "with sparsity" (i.e. max theoretical with zeroes in the right places).

https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid...

But at these levels of compute, the memory/interconnect bandwidth becomes the bottleneck.

25. okdood64 ◴[18 Nov 24 23:02 UTC] No.42178108[source]▶

>>42176440 #

Maybe the DOJ will come in and call it anti-trust shenanigans.

Not that I would want this...

26. ◴[18 Nov 24 23:14 UTC] No.42178222{4}[source]▶

>>42177621 #

27. llm_trw ◴[18 Nov 24 23:19 UTC] No.42178259[source]▶

>>42176400 (TP) #

At this scale cuda is quite useless.

You need to develop your own in house solution to distributing workloads.

The difference to regular clusters is that all the memory is globally visible, so machine 0023 can access and modify address 0x0123456789abcdef0123456789abcdef which happens to be on machine 0999.

28. Wytwwww ◴[18 Nov 24 23:37 UTC] No.42178459{6}[source]▶

>>42177861 #

The ratio between power usage and GPU cost is very, very different than with CPUs, though. If you could save e.g. 20-30% of the purchase price that might make it worth it.

e.g. you could run a H100 at 100% utilization 24/7 for 1 years at $0.4 per kWh (so assuming significant overhead for infrastructure etc.) and that would only cost ~10% of the purchase price of the GPU itself.

replies(1): >>42179046 #

29. Wytwwww ◴[18 Nov 24 23:40 UTC] No.42178483{3}[source]▶

>>42177061 #

> Apple should make a datacenter GPU

Aren't their GPUs pretty slow, though? Not even remotely close to Nvidia's consumer GPU with only (significant) upside being the much higher memory capacity.

30. talldayo ◴[19 Nov 24 00:53 UTC] No.42179042{5}[source]▶

>>42177935 #

Software developers using Macbooks doesn't mean Apple solved the ML problem. The past 10 years of MacOS removing features has somewhat proved that software developers will keep using Macs even when the featureset regresses. Like how Apple used to support OpenCL as a CUDA alternative, but gave up on it altogether to focus on simpler, mobile-friendly GPU designs.

The Pytorch MPS patches are a fun appeasement for developers, but they didn't unthrone Nvidia's demand. They didn't beat Nvidia on performance per watt, they didn't match their price, their scale or CUDA's featureset, and they don't even provide basic server drivers. It's got nothing to do with what brand you prefer and everything to do with what makes actual sense in a datacenter. Apple can't take on Nvidia clusters without copying Nvidia's current architecture - Apple Silicon's current architecture is too inefficient to be a serious replacement to Nvidia clusters.

If Apple wanted to have a shot at entering the cluster game, that window of opportunity closed when Apple Silicon converged on simplified GPU designs. The 2w NPUs and compute shaders aren't going to make Nvidia scared, let alone compete with AMD's market share.

31. wbl ◴[19 Nov 24 00:53 UTC] No.42179046{7}[source]▶

>>42178459 #

Power usage cost isn't the money but the capacity and cooling.

replies(1): >>42181611 #

32. yeahwhatever10 ◴[19 Nov 24 02:41 UTC] No.42179625[source]▶

>>42176400 (TP) #

CUDA is one part, but another part of Nvidia's lead is their focus on bandwidth both memory and GPU-GPU communication. AMD dramatically falls behind Nvidia in training because of its terrible collective times (AllReduce, AllGather, etc.)

33. metadat ◴[19 Nov 24 03:20 UTC] No.42179813{3}[source]▶

>>42177452 #

How many teraflops in an exaflop? The tera is screwing me up.. Google not helping today, so many cards.

replies(1): >>42180127 #

34. shmerl ◴[19 Nov 24 03:30 UTC] No.42179870[source]▶

>>42176440 #

There is ZLUDA to break the lock-in for those who are stuck with it. The rest will use something else.

35. aaronblohowiak ◴[19 Nov 24 04:37 UTC] No.42180127{4}[source]▶

>>42179813 #

https://en.m.wikipedia.org/wiki/Metric_prefix

36. jillesvangurp ◴[19 Nov 24 05:02 UTC] No.42180214[source]▶

>>42176440 #

Software will bridge the gap. There are simply too many competing platforms out there that are not Nvidia based. Most decent AI libraries and frameworks already need to support more than just Nvidia. There's a reason macs are popular with AI researchers: many of these platforms support Apple's chips already and they perform pretty well. Anything that doesn't support those chips, is a problem waiting to be fixed with plenty of people working on fixing that. If it can be fixed for Apple's chips, it can also be fixed for other people's chips.

And of course there is some serious amount of money sloshing around in this space. Things being hard doesn't mean it's impossible. And there's no shortage of extremely well funded companies working on this stuff. All your favorite trillion $ companies basically. And most of them have their own AI chips too. And probably some reservations about perpetually handing a lot of their cash to Nvidia.

If you want an example of a company that used to have a gigantic moat that is now dealing with a lot of competition, look at Intel. X86 used to be that moat. And that's looking pretty weak lately. One reason that AMD is in the news a lot lately is that they are growing at Intel's expense. Nvidia might be their next target.

37. Wytwwww ◴[19 Nov 24 09:42 UTC] No.42181611{8}[source]▶

>>42179046 #

Yes, I know that. Hence I quadrupled the price of electricity or are you saying that the cost of capacity and cooling doesn't scale directly with power usage?

We can increase that another 2x and the cost would still be relatively low compared to the price/deprecation of the GPU itself.

38. pjmlp ◴[19 Nov 24 12:09 UTC] No.42182561{3}[source]▶

>>42176711 #

Only if the only thing one cares about is CUDA C++, and not CUDA C, CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging.

replies(1): >>42186939 #

39. pjmlp ◴[19 Nov 24 12:10 UTC] No.42182571{3}[source]▶

>>42176567 #

First they need to support everything that CUDA is capable of in programing language portfolio, tooling and libraries.

replies(1): >>42183003 #

40. pjmlp ◴[19 Nov 24 12:16 UTC] No.42182611{3}[source]▶

>>42176809 #

CUDA is much more than that, and missing that out is exactly why NVidia keeps winning.

replies(1): >>42184234 #

41. bryanlarsen ◴[19 Nov 24 13:09 UTC] No.42183003{4}[source]▶

>>42182571 #

A typical LLM might use about 0.1% of CUDA. That's all that would have to be ported to get that LLM to work.

replies(1): >>42183651 #

42. pjmlp ◴[19 Nov 24 14:15 UTC] No.42183651{5}[source]▶

>>42183003 #

Which is missing the point why CUDA has won.

Then again, maybe the goal is getting 0.1% of CUDA market share. /s

replies(2): >>42184109 #>>42184220 #

43. its_down_again ◴[19 Nov 24 14:59 UTC] No.42184109{6}[source]▶

>>42183651 #

In the words of Gilfoyle-- I'll bite. Why has CUDA won?

replies(1): >>42184726 #

44. imtringued ◴[19 Nov 24 15:09 UTC] No.42184220{6}[source]▶

>>42183651 #

Nvidia has won because their compute drivers don't crash people's systems when they run e.g. Vulkan Compute.

You are mostly listing irrelevant nice to have things that aren't deal breakers. AMD's consumer GPUs have a long history of being abandoned a year or two after release.

replies(1): >>42184735 #

45. imtringued ◴[19 Nov 24 15:10 UTC] No.42184234{4}[source]▶

>>42182611 #

Again, I have AMD hardware and can't use it.

replies(1): >>42184701 #

46. pjmlp ◴[19 Nov 24 15:42 UTC] No.42184701{5}[source]▶

>>42184234 #

AMD is to blame for where they stand.

47. pjmlp ◴[19 Nov 24 15:44 UTC] No.42184726{7}[source]▶

>>42184109 #

CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging.

Coupled with Khronos, Intel, AMD never delivering anything comparable with OpenCL, Apple losing interest after Khronos didn't took OpenCL into the direction they wanted, Google never adopting it favouring their Renderscript dialect.

48. pjmlp ◴[19 Nov 24 15:45 UTC] No.42184735{7}[source]▶

>>42184220 #

CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging, aren't only nice to have things.

49. vlovich123 ◴[19 Nov 24 19:02 UTC] No.42186939{4}[source]▶

>>42182561 #

CUDA C works fine with HIP not sure what you're referring to. As for the other pieces, GPU graphical debugging isn't relevant for CUDA and I don't know what IDE integration is special / relevant for CUDA but AMD does have a ROCm debugger which I would imagine would be sufficient for simultaneous debugging of CPU & GPU. You won't get developer tools like nsight systems but I'm pretty sure AMD has equivalent tooling.

As for Fortran, that doesn't come up much in modern AI stuff. I haven't observed PTX / GCN assembly within AI codebases but maybe you have extra insight there.

50. saagarjha ◴[20 Nov 24 04:35 UTC] No.42190848[source]▶

>>42177944 #

People most certainly do use CUDA

↑