Most active commenters

tucnak(7)
almostgotcaught(5)
fooblaster(3)

Popular/hot comments

>>44372797 #

←back to thread

Basic Facts about GPUs

(damek.github.io)

Show context

elashri ◴[24 Jun 25 14:52 UTC] No.44366911[source]▶

>>44365320 (OP) #

Good article summarizing good chunk of information that people should have some idea about. I just want to comment that the title is a little bit misleading because this is talking about the very choices that NVIDIA follows in developing their GPU archs which is not what always what others do.

For example, the arithmetic intensity break-even point (ridge-point) is very different once you leave the NVIDIA-land. If we take AMD Instinct MI300, it has up to 160 TFLOPS FP32 paired with ~6 TB/s of HBM3/3E bandwidth gives a ridge-point near 27 FLOPs/byte which is about double that of the A100’s 13 FLOPs/byte. The larger on-package HBM (128 – 256 GB) GPU memory also shifts the practical trade-offs between tiling depth and occupancy. Although this is very expensive and does not have CUDA (which can be good and bad at the same time).

replies(2): >>44367014 #>>44380929 #

1. apitman ◴[24 Jun 25 15:02 UTC] No.44367014[source]▶

>>44366911 #

Unfortunately Nvidia GPUs are the only ones that matter until AMD starts taking their computer software seriously.

replies(2): >>44367150 #>>44368272 #

2. fooblaster ◴[24 Jun 25 15:15 UTC] No.44367150[source]▶

>>44367014 (TP) #

They are. It's just not at the consumer hardware level.

replies(2): >>44368013 #>>44368161 #

3. have-a-break ◴[24 Jun 25 16:35 UTC] No.44368013[source]▶

>>44367150 #

You could argue it's all the nice GPU debugging tools nVidia provides which makes GPU programming accessible.

There are so many potential bottlenecks (normally just memory access patterns, but without tools to verify you have to design and run manual experiments).

4. tucnak ◴[24 Jun 25 16:46 UTC] No.44368161[source]▶

>>44367150 #

This misconception is repeated time and time again; software support of their datacenter-grade hardware is just as bad. I've had the displeasure of using MI50, MI100 (a lot), MI210 (very briefly.) All three are supposedly enterprise-grade computing hardware, and yet, it was a pathetic experience with a myriad of disconnected components which had to be patched, & married with a very specific kernel version to get ANY kind of LLM inference going.

Now, the last of it I bothered with was 9 months ago; enough is enough.

replies(1): >>44369737 #

5. tucnak ◴[24 Jun 25 16:56 UTC] No.44368272[source]▶

>>44367014 (TP) #

Unfortunately, GPU's are old news now. When it comes to perf/watt/dollar, TPU's are substantially ahead for both training and inference. There's a sparsity disadvantage with the trailing-edge TPU devices such as v4 but if you care about large-scale training of any sort, it's not even close. Additionally, Tenstorrent p300 devices are hitting the market soon enough, and there's lots of promising stuff is coming on Xilinx side of the AMD shop: the recent Versal chips allow for AI compute-in-network capabilities that puts NVIDIA Bluefield's supposed programmability to shame. NVIDIA likes to say Bluefield is like a next-generation SmartNIC, but compared to actually field-programmable Versal stuff, it's more like 100BASE-T cards from the 90s.

I think it's very naive to assume that GPU's will continue to dominate the AI landscape.

replies(2): >>44369832 #>>44370305 #

6. fooblaster ◴[24 Jun 25 19:04 UTC] No.44369737{3}[source]▶

>>44368161 #

this hardware is ancient history. mi250 and mi300 are much better supported

replies(1): >>44370312 #

7. menaerus ◴[24 Jun 25 19:12 UTC] No.44369832[source]▶

>>44368272 #

So, where does one buy a TPU?

replies(1): >>44370398 #

8. almostgotcaught ◴[24 Jun 25 19:50 UTC] No.44370305[source]▶

>>44368272 #

> Unfortunately, GPU's are old news now

...

> the recent Versal chips allow for AI compute-in-network capabilities that puts NVIDIA Bluefield's supposed programmability to shame

I'm always just like... who are you people. Like what is the profile of a person that just goes around proclaiming wild things as if they're completely established. And I see this kind of comment on hn very frequently. Like you either work for Tenstorrent or you're an influencer or a zdnet presenter or just ... because none of this even remotely true.

Reminds me of

"My father would womanize; he would drink. He would make outrageous claims like he invented the question mark. Sometimes, he would accuse chestnuts of being lazy."

> I think it's very naive to assume that GPU's will continue to dominate the AI landscape

I'm just curious - how much of your portfolio is AMD and how much is NVDA and how much is GOOG?

replies(2): >>44370454 #>>44371227 #

9. tucnak ◴[24 Jun 25 19:50 UTC] No.44370312{4}[source]▶

>>44369737 #

What a load of nonsense. MI210 effectively hit the market in 2023, similarly to H100. We're talking about datacenter-grade, two-year out of date card, and it's already "ancient history?"

No wonder nobody on this site trusts AMD.

replies(2): >>44370954 #>>44372754 #

10. tucnak ◴[24 Jun 25 19:59 UTC] No.44370398{3}[source]▶

>>44369832 #

The actual lead times on similarly-capable GPU systems are so long, by the time your order is executed, you're already losing money. Even assuming perfect utilization, and perfect after-market conditions—you won't be making any money on the hardware anyway.

Buy v. rent calculus is only viable if there's no asymmetry between the two. Oftentimes, what you can rent you cannot buy, and vice-versa, what you can buy—you could never rent. Even if you _could_ buy an actual TPU, you wouldn't be able to run it anyway, as it's all built around sophisticated networking and switching topologies[1]. The same goes for GPU deployments of comparable scale: what made you think that you could buy and run GPU's at scale?

It's a fantasy.

[1] https://arxiv.org/abs/2304.01433

replies(2): >>44370513 #>>44371092 #

11. timeinput ◴[24 Jun 25 20:04 UTC] No.44370454{3}[source]▶

>>44370305 #

Listen, I'm ~~not~~ all in on Ferrero Rocher, and chestnuts *are* lazy. No where near as productive as hazelnuts.

12. almostgotcaught ◴[24 Jun 25 20:10 UTC] No.44370513{4}[source]▶

>>44370398 #

Is your answer to "where can I buy a TPU" that you can't buy a GPU either? That's a new one.

First of all I don't understand how that's an answer. Second of all it's laughably wrong - I can name 5 firms (outside of FAANG) off the top of my head with >1k Blackwell devices and they're making very good money (have you ever heard of quantfi....). Third of all, how is TPU going to conquer absolutely anything when (as you admit) you couldn't run one even if you could buy one?

replies(1): >>44371304 #

13. bluescrn ◴[24 Jun 25 20:55 UTC] No.44370954{5}[source]▶

>>44370312 #

Unless you're, you know, using GPUs for graphics...

Xbox, Playstation, and Steam Deck seem to be doing pretty nicely with AMD.

replies(1): >>44372550 #

14. menaerus ◴[24 Jun 25 21:09 UTC] No.44371092{4}[source]▶

>>44370398 #

Right. Your argument doesn't really follow. Since I cannot buy a TPU, which you agree with, then a single viable option is really only a GPU, which I _can_ buy.

So, according to that, GPUs aren't really going anywhere unless there's a new player in a town who will compete with the Nvidia and sell at lower prices.

15. tucnak ◴[24 Jun 25 21:24 UTC] No.44371227{3}[source]▶

>>44370305 #

> I'm just curious - now much of your portfolio is AMD

I'm always just like... who are you people: financiers, or hackers? :-) I don't work for TT, but I am a founder in the vertical AI space. Firstly, every major player is making AI accelerators of their own now, and guess what, most state-of-the-art designs have very little in common with a GPGPU design of yester-year. We have thoroughly evaluated various options, including buying/renting NVIDIA hardware; unfortunately, it didn't make any sense—neither in terms of cost, nor capability. Buying (and waiting _months_ for) NVIDIA rack-fuls is the quickest way to bankrupt your business with CAPEX. Renting the same hardware is merely moving the disease to OPEX, and in post-ZIRP era this is equally devastating.

No matter how much HBM memory you get for whatever individual device, no matter the packaging—it's never going to be enough. The weights alone are quickly dwarfed by K/V cache pages anyway. This is doubly true, if you're executing highly-concurrent agents that share a lot of the context, or doing dataset-scale inference transformations. The only thing that matters, truly, is the ability to scale-out, meaning fabrics, RDMA over fabrics. Even the leading-edge GPU systems aren't really good at it, because none of the interconnect is actually programmable.

The current generation of TT cards (7nm) has four 800G NIC's per card, and the actual Blackhole chips[1] support up to 12x400G. You can approach TT, they will license you the IP, and you get to integrate it at whatever scale you please (good luck even getting in a room with Arm people!) and because TT's whole stack is open source, you get to "punch in" whatever topology you want[2]. In other words, at least with TT you would get a chance to scale-out without bankrupting your business.

The compute hierarchy is fresh and in line with the latest research, their toolchain is as as hackable as it gets, and stands multiple heads above anything that AMD or Intel had ever released. Most importantly, because TT is currently under-valued, it presents an outstanding opportunity for businesses like ours in navigating around the established cost-centers. For example, TT still offers "Galaxy" deployments which used to contain 32 previous-generation (Wormhole) devices in a 6U air-cooled chassis. It's not a stretch that a similar setup, composed of 32 liquid-cooled Blackholes (2 TB GDDR6, 100 Tbps interconnect) would fit in a 4U chassis. AFAIK, There's no GPU deployment in the world at that density. Similarly to TPU design, it's also infinitely scalable by means of 3+D twisted torus topologies.

What's currently missing in the TT ecosystem: (1) the "superchip" package including state of the art CPU cores, like TT-Ascalon, that they would also happily license to you, and perhaps more importantly, (2) compute-in-network capability, so that the stupidly-massive TT interconnect bandwidth could be exploited/informed by applications.

Firstly, the Grendel superchip is expected to hit the market by the end of next year.

Secondly, because the interconnect is not some proprietary bullshit from Mellanox, you get to introduce the programmable-logic NIC's into the topology, and maybe even avoid IP encapsulation altogether! There are many reasons to do so, and indeed, Versal FPGA's have lots to offer in terms of hard IP in addition to PL. K/V cache management with offloading to NVMe-oF clusters, prefix-matching, reshaping, quantization, compression, and all the other terribly-parallel tasks which are basically intractable for anything other than FPGA's.

Today, if we wanted to do a large-scale training run, we would simply go for the most cost-effective option available at scale, which is renting TPU v6 from Google. This is a temporary measure, if anything, because compute-in-network in AI deployments is still a novelty, and nobody can really do it at sufficiently-large scale yet. Thankfully, Xilinx is getting there[3]. AWS offers f1 instances, it does offer NVMe-accelerated ones, as well as AI acclerators, but there's a good reason they're unable to offer all three at the same time.

[1] https://riscv.epcc.ed.ac.uk/assets/files/hpcasia25/Tenstorre...

[2] https://github.com/tenstorrent/tt-metal/blob/main/tech_repor...

[3] https://www.amd.com/en/products/accelerators/alveo/v80.html

replies(1): >>44371308 #

16. tucnak ◴[24 Jun 25 21:34 UTC] No.44371304{5}[source]▶

>>44370513 #

I'd never claimed that "TPU is going to conquer everything," it's a matter of fact that the latest-generation TPU is currently the most cost-effective solution for large-scale training. I'm not even saying that NVIDIA has lost, just that GPU's have lost. Maybe NVIDIA comes up with a non-GPU based system, and it includes programmable fabric to enable compute-in-network capabilities, sure, anything other than Bluefield nonsense, but it's already clear from the engineering standpoint that the large HBM-stacks attached to a "GPU"+Bluefield formula is over.

replies(1): >>44371369 #

17. almostgotcaught ◴[24 Jun 25 21:40 UTC] No.44371369{6}[source]▶

>>44371304 #

> NVIDIA has lost, just that GPU's have lost

i hope you realize how silly you sound when

1. NVDA's market cap is 70% more than GOOG's

2. there is literally not a single other viable competitor to GPGPU amongst the 30 or so "accelerator" companies that all swear their thing will definitely be the one, even with many of them approaching 10 years in the market by now (cerebras, samba nova, groq, dmatrix, blah blah blah).

18. tucnak ◴[24 Jun 25 21:53 UTC] No.44371462{5}[source]▶

>>44371308 #

Your obsession with finance/marketing is exactly what I expect to see on HN.

It's a shame your accusations have zero merit. In the future, please try not to embarrass yourself by attempting to get into technical discussion, and promptly backing out of it having not made a single technical argument in the process. Good luck on the stock market

See https://news.ycombinator.com/newsguidelines.html

replies(1): >>44371508 #

19. almostgotcaught ◴[24 Jun 25 21:59 UTC] No.44371508{6}[source]▶

>>44371462 #

> Your obsession with finance/marketing is exactly what I expect to see on HN.

homie i work on AI infra at one of these companies that you're so casually citing in all of your marketing content here. you're not simply wrong on the things you claim - you're not even wrong. you literally don't know what you're talking about because you're citing external facing docs/code/whatever.

> attempting to get into technical discussion, and promptly backing out of it having not made a single technical argument in the process

there's no technical discussion to be had with someone that cites other people's work as proof for their own claims.

20. MindSpunk ◴[25 Jun 25 00:32 UTC] No.44372550{6}[source]▶

>>44370954 #

The quantity of people on this site now that care about GPUs all of a sudden because of the explosion of LLMs, who fail to understand that GPUs are _graphics_ processors that are designed for _graphics_ workloads is insane. It almost feels like the popular opinion here is that graphics is just dead and AMD and NVIDIA should throw everything else they do in the bin to chase the LLM bag.

AMD make excellent graphics hardware, and the graphics tools are also fantastic. AMD's pricing and market positioning can be questionable but the hardware is great. They're not as strong with machine learning tasks, and they're in a follower position for tensor acceleration, but for graphics they are very solid.

replies(2): >>44372797 #>>44373861 #

21. fooblaster ◴[25 Jun 25 01:08 UTC] No.44372754{5}[source]▶

>>44370312 #

my experience with the mi300 does not mirror yours. If I have a complaint, it's that it's performance does not live up to expectations.

22. almostgotcaught ◴[25 Jun 25 01:15 UTC] No.44372797{7}[source]▶

>>44372550 #

The quantity of people on this site now that think they understand modern GPUs because back in the day they wrote some opengl...

1. Both AMD and NVIDIA have "tensorcore" ISA instructions (ie real silicon/data-path, not emulation) which have zero use case in graphics

2. Ain't no one playing video games on MI300/H100 etc and the ISA/architecture reflects that

> but for graphics they are very solid.

Hmmm I wonder if AMD's overfit-to-graphics architectural design choices are a source of friction as they now transition to serving the ML compute market... Hmmm I wonder if they're actively undoing some of these choices...

replies(3): >>44373920 #>>44375258 #>>44375768 #

23. _carbyau_ ◴[25 Jun 25 05:21 UTC] No.44373861{7}[source]▶

>>44372550 #

Just having fun with an out of context quote.

> graphics is just dead and AMD and NVIDIA should throw everything else they do in the bin to chase the LLM bag

No graphics means that games of the future will be like:

"You have been eaten by a ClautGemPilot."

24. MindSpunk ◴[25 Jun 25 05:34 UTC] No.44373920{8}[source]▶

>>44372797 #

AMD isn't overfit to graphics. AMD's GPUs were friendly to general purpose compute well before Nvidia was. Hardware-wise anyway. AMD's memory access system and resource binding model was well ahead of Nvidia for a long time. When Nvidia was stuffing resource descriptors into special palettes with addressing limits, AMD was fully bindless under the hood. Everything was just one big address space, descriptors and data.

Nvidia 15 years ago was overfit to graphics. Nvidia just made smarter choices, sold more hardware and re-invested their winnings into software and improving their hardware. Now they're just as good at GPGPU with a stronger software stack.

AMD has struggled to be anything other than a follower in the market and has suffered quite a lot as a result. Even in graphics. Mesh shaders in DX12 was the result of NVIDIA dictating a new execution model that was very favorable to their new hardware while AMD had already had a similar (but not perfectly compatible) system since the Vega called primitive shaders.

25. averne_ ◴[25 Jun 25 09:27 UTC] No.44375258{8}[source]▶

>>44372797 #

Matrix instructions do of course have uses in graphics. One example of this is DLSS.

replies(1): >>44378631 #

26. lomase ◴[25 Jun 25 10:51 UTC] No.44375768{8}[source]▶

>>44372797 #

Imagine thinking you know more than others because you use a different abstraction layer.

27. Agentlien ◴[25 Jun 25 15:47 UTC] No.44378631{9}[source]▶

>>44375258 #

This feels backwards to me when GPUs were created largely because graphics needed lots of parallel floating point operations, a big chunk of which are matrix multiplications.

When I think of matrix multiplication in graphics I primarily think of transforms between spaces: moving vertices from object space to camera space, transforming from camera space to screen space, ... This is a big part of the math done in regular rendering and needs to be done for every visible vertex in the scene - typically in the millions in modern games.

I suppose the difference here is that DLSS is a case where you primarily do large numbers of consecutive matrix multiplications with little other logic, since it's more ANN code than graphics code.

↑