←back to thread

212 points pella | 3 comments | | HN request time: 0.001s | source
Show context
btown ◴[] No.42748940[source]
I've often thought that one of the places AMD could distinguish itself from NVIDIA is bringing significantly higher amounts of VRAM (or memory systems that are as performant as what we currently know as VRAM) to the consumer space.

A card with a fraction of the FLOPS of cutting-edge graphics cards (and ideally proportionally less power consumption), but with 64-128GB VRAM-equivalent, would be a gamechanger for letting people experiment with large multi-modal models, and seriously incentivize researchers to build the next generation of tensor abstraction libraries for both CUDA and ROCm/HIP. And for gaming, you could break new grounds on high-resolution textures. AMD would be back in the game.

Of course, if it's not real VRAM, it needs to be at least somewhat close on the latency and bandwidth front, so let's pop on over and see what's happening in this article...

> An Infinity Cache hit has a load-to-use latency of over 140 ns. Even DRAM on the AMD Ryzen 9 7950X3D shows less latency. Missing Infinity Cache of course drives latency up even higher, to a staggering 227 ns. HBM stands for High Bandwidth Memory, not low latency memory, and it shows.

Welp. Guess my wish isn't coming true today.

replies(10): >>42749016 #>>42749039 #>>42749048 #>>42749096 #>>42749201 #>>42749629 #>>42749785 #>>42749805 #>>42752432 #>>42752946 #
therealpygon ◴[] No.42749785[source]
I wholeheartedly agree. Nvidia is intentionally suppressing the amount of memory on their consumer GPUs to prevent data centers from using consumer cards rather than their far more expensive counterparts. The fact that they used to offer the 3060 with 12GB, but have now pushed the pricing higher and limited many cards to 8GB is a testament to the fact they are. I don’t need giga-TOPS with 8-16gb of memory, I’d be perfectly happy with half that speed but with 64gb of memory or more. Even slower memory would be fine. I don’t need 1000t/s, but being able to load a reasonable intelligent model even at 50t/s would be great.
replies(1): >>42749897 #
1. lhl ◴[] No.42749897{3}[source]
Getting to 50 tok/s for a big model requires not just memory, but also memory bandwidth. Currently, 1TB/s of MBW will get a 70B Q4 (~40GB) model to about 20-25 tok/s. The good thing is models continue to get smarter - today's 20-30B models beat out last years 70B models on most tasks and the biggest open models like DeepSeek-v3 might have lots of weights, but actually a relatively reasonable # of activations/pass.

You can test out your half the speed but w/ 64GB or more of memory w/ the latest Macs, AMD Strix Halo, or the upcoming Nvidia Digits, though. I suspect by the middle of the year there will be a bunch of options in the ~$3K range. Personally, I think I'd rather go for 2 x 5090s for 64GB of memory at 1.7TB/s than 96 or 128GB w/ only 250GB/s of MBW.

replies(1): >>42749979 #
2. sroussey ◴[] No.42749979[source]
A Mac with that memory will have closer to 500GB/s but your point still stands.

That said, if you just want to play around, having more memory will let you do more interesting things. I’d rather have that option over speed since I won’t be doing production inference serving on my laptop.

replies(1): >>42750021 #
3. lhl ◴[] No.42750021[source]
Yeah, the M4 Max actually has pretty decent MBW - 546 GB/s (cheapest config is $4.7K on a 14" MBP atm, but maybe there will be a Mac Studio at some point). The big weakness for the Mac is actually the lack of TFLOPS on the GPU - the beefiest maxes out at ~34 FP16 TFLOPS. It makes a lot of use cases super painful, since prefill/prompt processing can take minutes before token generation starts.