The AMD Radeon Instinct MI300A's Giant Memory Subsystem

(chipsandcheese.com)

212 points pella | 4 comments | 18 Jan 25 12:28 UTC | HN request time: 0.86s | source

Show context

btown ◴[18 Jan 25 15:12 UTC] No.42748940[source]▶

I've often thought that one of the places AMD could distinguish itself from NVIDIA is bringing significantly higher amounts of VRAM (or memory systems that are as performant as what we currently know as VRAM) to the consumer space.

A card with a fraction of the FLOPS of cutting-edge graphics cards (and ideally proportionally less power consumption), but with 64-128GB VRAM-equivalent, would be a gamechanger for letting people experiment with large multi-modal models, and seriously incentivize researchers to build the next generation of tensor abstraction libraries for both CUDA and ROCm/HIP. And for gaming, you could break new grounds on high-resolution textures. AMD would be back in the game.

Of course, if it's not real VRAM, it needs to be at least somewhat close on the latency and bandwidth front, so let's pop on over and see what's happening in this article...

> An Infinity Cache hit has a load-to-use latency of over 140 ns. Even DRAM on the AMD Ryzen 9 7950X3D shows less latency. Missing Infinity Cache of course drives latency up even higher, to a staggering 227 ns. HBM stands for High Bandwidth Memory, not low latency memory, and it shows.

Welp. Guess my wish isn't coming true today.

replies(10): >>42749016 #>>42749039 #>>42749048 #>>42749096 #>>42749201 #>>42749629 #>>42749785 #>>42749805 #>>42752432 #>>42752946 #

Fade_Dance ◴[18 Jan 25 15:37 UTC] No.42749096[source]▶

>>42748940 #

Assuming we are comparing chips that are using the latest generation/high density memory modules, a wider bus width is required for larger memory counts, which is expensive when it comes to silicon area. Therefore, if AMD is willing to boost up memory count as a competitive advantage, they may as well also consider using that die space for more logic gates as well. It's a set of trade-offs and an optimization problem to some degree.

That said, when an incumbent has a leadership advantage, one of the obvious ways to boost profit is to slash the memory bus width, and then a competitor can come in and bring it up a bit and have a competitive offering. The industry has certainly seen this pattern many times. But as far as AMD coming in and using gigantic memory counts as a competitive advantage? You have to keep in mind the die space constraints.

Well over a decade ago - I think it was R600 - AMD did take this approach, and it was fairly disastrous because the logic performance of the chip wasn't good enough while the die was too big and hot and yields were too low. They didn't strike the right balance and sacrificed too much for a 512-bit memory bus.

AMD has also tried to sidestep some of these limitations with HBM back when it was new technology, but that didn't work out for them either. They actually would have been better off just increasing bus width and continuing to use the most optimized and cost efficient commodity memory chips in that case.

Data center and such may have a bit more freedom for innovation but the consumer space is definitely stuck on the paradigm of GPU plus nearby mem chips, and going outside of that fence is a huge latency hit.

replies(2): >>42749451 #>>42752086 #

1. amluto ◴[18 Jan 25 16:29 UTC] No.42749451[source]▶

>>42749096 #

> a wider bus width is required for larger memory counts, which is expensive when it comes to silicon area

I find this constraint to be rather odd. An extra, say, three address bits would add very little space (or latency in a serial protocol) to a memory bus, and the actual problem seems to be that the current generation of memory chips are intended for point-to-point connection.

It seems to me that, if the memory vendors aren’t building physically larger, higher capacity chips, then any of the major players (AMD, Nvidia, Intel, whoever else is in this field right now) could kludge around it with a multiplexer. A multiplexer would need to be somewhat large, but its job would be simple enough that it should be doable with an older, cheaper process and without using entirely unreasonable amounts of power.

So my assumption is this is mostly an economic issue. The vendors don’t think it’s worthwhile to do this.

replies(2): >>42749962 #>>42751635 #

2. sroussey ◴[18 Jan 25 17:45 UTC] No.42749962[source]▶

>>42749451 (TP) #

Bus width they are talking about are multiples of 128. I think Apple m series chips are good examples. They go from 128 to 256 to 512 bits which just happens to be roughly about the megabytes per second bandwidth.

3. formerly_proven ◴[18 Jan 25 21:39 UTC] No.42751635[source]▶

>>42749451 (TP) #

GDDR has been point-to-point since... I dunno, probably 2000? Because cet par you can't really have an actual bus when you chase maximum bandwidth. Even the double-sided layouts (like T-layout, with <2mm stubs) typically incur a reduction in data rate. These also dissipate a fair amount of heat, you're looking at around 5-8 W per chip (~6 pJ/bit), it's not like you can just stack a bunch of those dies.

> A multiplexer would need to be somewhat large, but its job would be simple enough that it should be doable with an older, cheaper process and without using entirely unreasonable amounts of power.

I don't know what you're basing that on. We're talking about 32 Gbps serdes here. Yes, there's multiplexers even for that. But what good is deciding which memory chip you want to use on boot-up?

replies(1): >>42752242 #

4. amluto ◴[18 Jan 25 23:21 UTC] No.42752242[source]▶

>>42751635 #

Not multiplexed on boot — multiplexed at run time. Build a chip that speaks the GDDR protocol to the host GPU and has 2-4 GDDR channels coming out the other end and aggregates the attached memory at the cost of an extra chip, some latency, some power, and an extra chip. As far as the GPU is concerned, it’s an extra large GDDR chip, and it would allow a GPU vendor to squeeze in more RAM without adding more pins to the GPU or needing to route more memory channels directly to it.

(Compare to something like Apple’s designs or “Project Digits”. Current- and next-gen GPUs have considerably higher memory bandwidth but considerably less memory capacity. Mostly my point is that I think Nvidia or AMD could make a desktop-style GPU with 2-4x the RAM, somewhat worse latency, but otherwise equivalent performance without needing Samsung or another vendor to build higher capacity GDDR chips than currently exist.)

↑