Most active commenters
  • Dylan16807(8)
  • nsteel(4)

←back to thread

172 points marban | 23 comments | | HN request time: 0.745s | source | bottom
Show context
InTheArena ◴[] No.40051885[source]
While everyone has focused on Apple's power-efficiency on the M series chips, one thing that has been very interesting is how powerful the unified memory model (by having the memory on-package with CPU) with large bandwidth to the memory actually is. Hence a lot of people in the local LLMA community are really going after high-memory Macs.

It's great to see NPUs here with the new Ryzen cores - but I wonder how effective they will be with off-die memory versus the Apple approach.

That said, it's nothing but great to see these capabilities in something other then a expensive NVIDIA card. Local NPUs may really help with edge deploying more conferencing capabilities.

Edited - sorry, ,meant on-package.

replies(8): >>40051950 #>>40052032 #>>40052167 #>>40052857 #>>40053126 #>>40054064 #>>40054570 #>>40054743 #
1. thsksbd ◴[] No.40052857[source]
Old becomes new, the SGI O2 had (off chip) a unified memory model for performance reasons.

Not a CS guy, but it seems to me that NUMA like architecture has to come back. Large RAM on chip (balancing a thermal budget between #ofcores vs ram), a much larger RAM off chip and even more RAM through a fast interconnect on a single kernel image. Like the Origin 300 had.

replies(3): >>40053224 #>>40054054 #>>40055112 #
2. Rinzler89 ◴[] No.40053224[source]
UMA in the SGI machines (and gaming consoles) made sense because all the memory chips at that time were equally slow, or fast, depending how you wanna look at it.

PC HW split the video memory from system memory once GDDRAM become so much faster than system RAM, but GDDRAM has too high latency for CPUs and DDR has too low bandwidth for GPUs, so the separation made sense for each's strengths and still does to this day. Unifying it again, like with AMD's APUs, means either compromises for the CPU or for the GPU. There's no free lunch.

Currently AMD APUs on the PC use unified DDRAM so CPU performance is top but GPU/NPU perforce is bottlenecked. If they were to use unified GDDRAM like in the PS5/Xbox then GPU/NPU performance would be top and CPU performance would be bottlenecked.

replies(3): >>40053947 #>>40054567 #>>40056075 #
3. Dalewyn ◴[] No.40053947[source]
>Unifying it again, like with AMD's APUs, means either compromises for the CPU or for the GPU. There's no free lunch.

I think the lunch here (it still ain't free) is that RAM speed means nothing if you don't have enough RAM in the first place, and this is a compromise solution to that practical problem.

replies(2): >>40054000 #>>40055288 #
4. Rinzler89 ◴[] No.40054000{3}[source]
>if you don't have enough RAM in the first place

Enough RAM for what task exactly? System RAM is plentiful and cheap nowadays(unless you buy Apple). I got new a laptop with 32GB RAM for about 750 Euros. But the speeds are too low for high-end gamming or LLM training for the poor APU.

replies(1): >>40054101 #
5. aidenn0 ◴[] No.40054054[source]
Several PC graphics standards attempted to offer high-speed access to system memory (though I think only VLB offered direct access to the memory controller at the same speed as the CPU). Not having to have dedicated GPU memory has obvious advantages, but it's hard to get right.
6. numpad0 ◴[] No.40054101{4}[source]
Enough RAM for LLM. There are GPUs faster than M2 Ultra but can't run LLMs normally, which make that speed a moot point for LLM use-cases.
7. numpad0 ◴[] No.40054567[source]
I suspect there are difficulties with DRAM latency and/or signal integrity with APUs and RAM-expandable GPUs. Wasted ALUs are wasted if you'd be stalling deep SIMD pipelines all the time.
8. Dylan16807 ◴[] No.40055112[source]
> Old becomes new

I disagree. This is like pointing at a 2024 model hatchback and saying "old becomes new" because you can cite a hatchback from 50 years ago.

There's a bevy of ways to isolate or combine memory pools, and mainstream hardware has consistently used many of these methods the entire time.

9. Dylan16807 ◴[] No.40055288{3}[source]
The real problem is a lack of competition in GPU production.

GDDR is not very expensive. You should be able to get a GPU with a mid-level chip and tons of memory, but it's just not offered. Instead, please pay triple or quadruple the price of a high end gaming GPU to get a model with double the memory and basically the same core.

The level of markup is astonishing. I can go from 8GB to 16GB on AMD for $60, but going from 24GB to 48GB costs $3000. And nvidia isn't better.

replies(4): >>40055629 #>>40055989 #>>40056434 #>>40069867 #
10. zozbot234 ◴[] No.40055629{4}[source]
> You should be able to get a GPU with a mid-level chip and tons of memory, but it's just not offered.

Apple unified memory is the easiest way to get exactly that. There is a markup on memory upgrades but it's quite reasonable, not at all extreme.

replies(1): >>40055688 #
11. Dylan16807 ◴[] No.40055688{5}[source]
But I can't put that GPU into my existing machine, so I'm still paying $3000 extra if I don't want that Apple machine to be my computer.
12. ◴[] No.40055989{4}[source]
13. fulafel ◴[] No.40056075[source]
The fancier and more expensive SGIs had higher bandwisth non UMA memory systems on the GPUs.

The thing about bandwidth is that you can just make wider buses with more of the same memory chips in parallel.

14. nsteel ◴[] No.40056434{4}[source]
This isn't my area but won't it be quite expensive to use that GDDR? The PHY and the controller are complicated and you've got max 2GB devices, so if you want more memory you need a wider bus. That requires more beachfront and therefore a bigger die. That must make it expensive once you go beyond what you can fit on your small, cheap chip. Do their 24GB+ cards really use GDDR?*

And you need to ensure you don't shoot yourself in the foot by making anything (relatively) cheap that could be useful for AI...

*Edit: wow yeh, they do! A 384-bit interface on some! Sounds hot.

replies(1): >>40059573 #
15. Dylan16807 ◴[] No.40059573{5}[source]
The upgrade is actually remarkably similar.

RX7600 and RX7600XT have 8GB or 16GB attached to a 128-bit bus.

RX7900XT and W7900 have 24GB or 48GB attached to a 384-bit bus.

Neither upgrade changes the bus width, only the memory chips.

The two upgraded models even use the same memory chips, as far as I can tell! GDDR6, 18000MT/s, 2GB per 16 pins. I couldn't confirm chip count on the W7900 but it's the same density.

replies(1): >>40064467 #
16. nsteel ◴[] No.40064467{6}[source]
Perhaps I should have been clearer and said "if you want more memory than the 16GB model you need a wider bus" but this confirms what I described above.

  - cheap gfx die with a 128-bit memory interface.
  - vastly more expensive gfx die with a 384-bit memory interface.
Essentially there's a cheap upgrade option available for each die, swapping the 1GB memory chips for 2GB memory chips (GDDR doesn't support a mix). If you have the 16GB model and you want more memory, there are no bigger memory chips available, and so you need a wider bus and that's going to cost significantly more to produce and hence they charge more.

As a side note, I would expect the the GDDR chips to be x32, rather than 16.

replies(1): >>40068306 #
17. Dylan16807 ◴[] No.40068306{7}[source]
> Essentially there's a cheap upgrade option available for each die

Hmm, I think you missed what my actual point was.

You can buy a card with the cheap die for $270.

You can buy a card with the expensive die for $1000.

So far, so good.

The cost of just the memory upgrade for the cheap die is $60.

The cost of just the memory upgrade for the expensive die is $3000.

None of that $3000 is going toward upgrading the bus width. Maybe one percent of it goes toward upgrading the circuit board. It's a huge market segmentation fee.

> As a side note, I would expect the the GDDR chips to be x32, rather than 16.

The pictures I've found show the 7600 using 4 ram chips and the 7600 XT using 8 ram chips.

replies(1): >>40069347 #
18. nsteel ◴[] No.40069347{8}[source]
I did miss your point, I misunderstood the hefty price tag was just for the memory. Thank you for not biting my head off! That is pretty mad. I'm struggling to think of any explanation other than yours. Even a few supporting upgrades maybe required for the extra memory (power supplies etc) wouldn't be anywhere near that cost.

I couldn't find much to actually see how many chips used but on https://www.techpowerup.com/review/sapphire-radeon-rx-7600-x... it says H56G42AS8DX-014 which is a x32 (https://product.skhynix.com/products/dram/gddr/gddr6.go?appT...). But either way it can't explain that pricing!

replies(1): >>40071488 #
19. paulmd ◴[] No.40069867{4}[source]
what exactly do you expect them to do about the routing? 384b gpus are as big as memory buses get in the modern era and that gives you 24gb capacity or 48gb when clamshelled. Higher-density 3GB modules that would allow 36GB/72GB have been repeatedly pushed back, which has kinda screwed over consoles as well - they are in the same bind with "insufficient" VRAM on MS and while sony is less awful they didn't give a VRAM increase with PS5 Pro either.

GB202 might be going to 512b bus which is unprecedented in the modern era (nobody has done it since the Hawaii/GCN2 days) but that's really as big as anybody cares to route right now. What do you propose for going past that?

Ironically AMD actually does have the capability to experiment and put bigger memory buses on smaller dies. And the MCM packaging actually does give you some physical fanout that makes the routing easier. But again, ultimately there is just no appetite for going to 512b or 768b buses at a design level, for very good technical reasons.

like it just is what it is - GDDR is just not very dense, this is what you can get out of it. Production of HBM-based gpus is constrained by stacking capacity, which is the same reason people can't get enough datacenter cards in the first place.

Higher-density LPDDR does exist, but you still have to route it, and bandwidth goes down quite a bit. And that's the Apple Silicon approach, which solves your problem but unfortunately a lot of people just flatly reject any offering from the Fruit Company for interpersonal reasons.

replies(1): >>40072886 #
20. Dylan16807 ◴[] No.40071488{9}[source]
The first two pictures on that page show 4 RAM chips on each side of the board.

https://media-www.micron.com/-/media/client/global/documents...

This document directly talks about splitting a 32 data bit connection across two GDDR6 chips, on page 7.

"Allows for a doubling of density. Two 8Gb devices appear to the controller as a single, logical 16Gb device with two 16-bite wide channels."

Do that with 16Gb chips and you match the GPUs we're talking about.

replies(1): >>40074037 #
21. Dylan16807 ◴[] No.40072886{5}[source]
> what exactly do you expect them to do about the routing? 384b gpus are as big as memory buses get in the modern era and that gives you 24gb capacity or 48gb when clamshelled.

I'm not asking for more than that. I'm asking for that to be available on a mainstream consumer model.

Most of the mid-tier GPUS from AMD and nVidia have 256 bit memory busses. I want 32GB on that bus at a reasonable price. Let's say $150 or $200 more than the 16GB version of the same model.

I appreciate the information about higher densities, though.

22. nsteel ◴[] No.40074037{10}[source]
Of course - clamshell mode! They don't connect half the data bus from each chip. Also explains how they fit it on the card so easily without (and cheeply).
replies(1): >>40078610 #
23. Dylan16807 ◴[] No.40078610{11}[source]
Yeah, though a way to connect fewer data pins to each chip doesn't particularly have to be clamshell, it just requires a little bit of flexibility in the design.