Most active commenters
  • dangus(4)
  • (4)
  • chaostheory(3)
  • wongarsu(3)

←back to thread

172 points marban | 42 comments | | HN request time: 1.725s | source | bottom
Show context
InTheArena ◴[] No.40051885[source]
While everyone has focused on Apple's power-efficiency on the M series chips, one thing that has been very interesting is how powerful the unified memory model (by having the memory on-package with CPU) with large bandwidth to the memory actually is. Hence a lot of people in the local LLMA community are really going after high-memory Macs.

It's great to see NPUs here with the new Ryzen cores - but I wonder how effective they will be with off-die memory versus the Apple approach.

That said, it's nothing but great to see these capabilities in something other then a expensive NVIDIA card. Local NPUs may really help with edge deploying more conferencing capabilities.

Edited - sorry, ,meant on-package.

replies(8): >>40051950 #>>40052032 #>>40052167 #>>40052857 #>>40053126 #>>40054064 #>>40054570 #>>40054743 #
1. chaostheory ◴[] No.40052032[source]
What Apple has is theoretically great on paper, but it fails to live up to expectations. Whats the point of having the RAM for running an LLM locally when the performance is abysmal compared to running it on even a consumer Nvidia GPU. It’s a missed opportunity that I hope either the M4 or M5 addresses
replies(8): >>40052327 #>>40052344 #>>40052929 #>>40053695 #>>40053835 #>>40054577 #>>40054855 #>>40056153 #
2. InTheArena ◴[] No.40052327[source]
The performance of oolama on my M1 MAX is pretty solid - and does things that my 2070 GPU can't do because of memory.
replies(1): >>40052675 #
3. bearjaws ◴[] No.40052344[source]
It's a 25w processor. How will it ever live up to a 400w GPU? Also you can't even run large models on a single 4090, but you can on M series laptops with enough RAM.

The fact a laptop can run 70B+ parameter models is a miracle, it's not what the chip was built to do at all.

replies(3): >>40052910 #>>40056761 #>>40059613 #
4. dangus ◴[] No.40052675[source]
Not that I don’t believe you but the 2070 is two generations and 5 years old. Maybe a comparison to a 4000 series would be more appropriate?
replies(2): >>40052731 #>>40052773 #
5. Kirby64 ◴[] No.40052731{3}[source]
The M1 Max is also 2 generations old, and ~3 years old at this point. Seems like a fair comparison to me.
replies(2): >>40052845 #>>40052863 #
6. Teever ◴[] No.40052773{3}[source]
Well, you know that it would still be able to do more than a 4000 series GPU from Nvidia because you can have more system memory in a mac than you can have video ram in a 4000 series GPU.
replies(1): >>40052862 #
7. dangus ◴[] No.40052845{4}[source]
The 4000 series still has a bigger gap in how much of a generational leap that product was.

The M3 Max has something like 33% faster overall graphics performance than the M1 Max (average benchmark) while the 4090 is something like 138% faster than the 2080Ti.

Depending on which 2070 and 4070 models you compare the difference is similar, close to or exceeding 100% uplift.

replies(1): >>40055156 #
8. dangus ◴[] No.40052862{4}[source]
Yes, obviously I’m aware that you can throw more RAM at an M-series GPU.

But of course that’s only helpful for specific workflows.

9. talldayo ◴[] No.40052863{4}[source]
Maybe it's controversial, but I don't think comparing 5nm mobile hardware from 2021 is a fair fight against 12nm desktop hardware from 2018.

And still, performance-wise, the 2070 still wins out by a ~33% margin: https://browser.geekbench.com/opencl-benchmarks

replies(2): >>40053373 #>>40055651 #
10. wongarsu ◴[] No.40052910[source]
It's a valid comparison in the very limited sense of "I have $2000 to spend on a way to run LLMs, should I get an RTX4090 for the computer I have or should I get a 24GB MacBook", or "I have $5000, should I get an RTX A6000 48GB or a 96GB MacBook".

Those comparisons are unreasonable in a sense, but they are implied by statements like GPs "Hence a lot of people in the local LLMA community are really going after high-memory Macs".

replies(3): >>40053367 #>>40054359 #>>40054554 #
11. zitterbewegung ◴[] No.40052929[source]
Buying a m3 max with 128gb of RAM while will underperform any consumer NVIDIA GPU it will be able to load larger models in practice but slowly.

I think a way for the m series chips to aggressively target GPU inference or training would need a strategy that increases the speed of the RAM to start to match GDDR6 or HBM3 or use it directly.

replies(2): >>40053002 #>>40056817 #
12. ◴[] No.40053002[source]
13. fckgw ◴[] No.40053367{3}[source]
No, it is not a valid comparison to make between an entire laptop and a single PC part. "The computer I have" is doing a ton of heavy lifting here.
replies(2): >>40053642 #>>40053845 #
14. chessgecko ◴[] No.40053373{5}[source]
For this comparison the generation of chip doesn’t really matter because the llm decode (which is the costly step) barely uses any of the perf and just needs the model weights to fit in memory
15. wongarsu ◴[] No.40053642{4}[source]
"Should I upgrade what I have or buy something new" is a completely normal everyday decision. Of course it doesn't apply to everyone since it presumes you have something compatible to upgrade, but it is a real decision lots of people are making
replies(1): >>40053750 #
16. evilduck ◴[] No.40053695[source]
That completely depends on your expectations and uses.

I have a gaming rig with a 4080 with 16GB of RAM and it can't even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized. Yeah it's fast when something fits on it, but I don't see much point in very fast generation of bad output. A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM and for ~$300 less, and it's a whole computer instead of a single part that still needs a computer around it. Compared to my 4080, that GPU and the surrounding computer get you half the VRAM capacity for greater cost than the Mac.

If you're building a rig with multiple GPUs to support many users or for internal private services and are willing to drop more than $3k then I think the equation swings back in favor of Nvidia, but not until then.

replies(4): >>40053808 #>>40053912 #>>40053916 #>>40059721 #
17. fckgw ◴[] No.40053750{5}[source]
But it's also assuming everyone has a desktop PC capable of this stuff that can be upgraded.
18. soupbowl ◴[] No.40053808[source]
Just buy another 16gb of ram for 80$....
replies(2): >>40053856 #>>40054038 #
19. ◴[] No.40053835[source]
20. michaelt ◴[] No.40053845{4}[source]
Sorta yes, sorta no.

You're certainly right that with a macbook you get a whole computer, so you're getting more for your money. And it's a luxury high-end computer too!

But personally, I've never seen anyone step directly from not-even-having-a-PC to buying a 4090 for $1800. Folks that aren't technically inclined by and large stick with hosted models like ChatGPT.

More common in my experience is for technical folks with, say, an 8GB GPU to experiment with local ML, decide they're interested in it, then step up to a 4090 or something.

21. elzbardico ◴[] No.40053856{3}[source]
GPU ram?
22. cjk2 ◴[] No.40053912[source]
Yeah this.

Also as an anecdote, my daily driver machine is a bottom end M2 Mac mini because I am a cheap ass. I paid less than it for the 4070 card in my desktop PC. The M2 Mac does a dehaze from RAW in lightroom in 18 seconds. My 4070 takes 9 seconds. So the GPU is twice as fast but the mac has a whole free computer stuck to it.

23. ◴[] No.40053916[source]
24. evilduck ◴[] No.40054038{3}[source]
Running your larger-than-your-GPU-VRAM LLM model on regular DDR ram will completely slaughter your token/s speed to the point that the Mac comes out ahead again.
replies(1): >>40056550 #
25. oceanplexian ◴[] No.40054359{3}[source]
The answer depends on what you plan to do with it.

Do you need to do fine tuning on a smaller model and need the highest inference performance with smaller models? Are you planning to use it as a lab to learn how to work with tools that are used in Big Tech (i.e. CUDA)? Or do you just want to do slow inference on super huge models (e.g Grok)?

Personally, I chose the Nvidia route because as a backend engineer, Macs aren’t seriously used in datacenters. The same frameworks I use to develop on a 3090 are transferable to massive, infiniband-connected clusters with TB of VRAM.

26. 0x457 ◴[] No.40054554{3}[source]
I think there are two communities:

- the "hobbyists" with $5k GPUs

- People that work in the industry that never used "not mac" or even if they did - explaining to IT that you need a PC with RTX A6000 48GB instead of a mac like literally everyone else in the company is a loosing battle.

replies(1): >>40054806 #
27. instagib ◴[] No.40054577[source]
One thing I would consider is usage throttling on a MacBook Pro. Would repeated LLM usage run into throttling?

No idea what specifically everyone is pulling their performance data from or what task(s).

Here is a video to help visualize the differences with a maxed out m3 max vs 16gbm1 pro vs 4090 on llm 7B/13b/70b llama 2. https://youtu.be/jaM02mb6JFM

Here’s a Reddit comparison of 4090 vs M2 Ultra 96gb with tokens/s

https://old.reddit.com/r/LocalLLaMA/comments/14319ra/rtx_409...

M3 pro memory BW 150 gb/s M3 max 10/30 300 gb/s M3 max 12/40 400 gb/s

“Llama models are mostly limited by memory bandwidth. rtx 3090 has 935.8 gb/s rtx 4090 has 1008 gb/s m2 ultra has 800 gb/s m2 max has 400 gb/s so 4090 is 10% faster for llama inference than 3090 and more than 2x faster than apple m2 max https://github.com/turboderp/exllama using exllama you can get 160 tokens/s in 7b model and 97 tokens/s in 13b model while m2 max has only 40 tokens/s in 7b model and 24 tokens/s in 13b apple 40/s Memory bandwidth cap is also the reason why llamas work so well on cpu (…)

buying second gpu will increase memory capacity to 48gb but has no effect on bandwidth so 2x 4090 will have 48gb vram and 1008 gb/s bandwidth and 50% utilization”

28. wongarsu ◴[] No.40054806{4}[source]
There is also an important third group:

- people that work outside Silicon Valley, where the entire company uses Windows centrally managed through Active Directory, and explaining IT that you need an Mac is an uphill battle. So you just submit your request for an RTX A6000 48GB to be added to your existing workstation

Those people are the intended target customer of the A6000, and there are a lot of them.

replies(1): >>40067455 #
29. john_alan ◴[] No.40054855[source]
what are you talking about, it's literally the fastest single core retail CPU globally, and multicore is close too - https://browser.geekbench.com/processor-benchmarks
replies(1): >>40063026 #
30. whizzter ◴[] No.40055156{5}[source]
Googling power draw the 4090 goes up to 450w whilst the 2080ti was at 250w, adjusting for power consumption the increase is somewhere around 32%. Some architectural gains and probably optimizations in chipset workings but we're not seeing as many amazing generational leaps anymore regardless of manufacturer/designer.
replies(1): >>40077726 #
31. JudasGoat ◴[] No.40055651{5}[source]
I found it interesting that the Apple M3 scored nearly identical to the Radeon 780M. I know the memory bandwidth is slower but you can add 2 32gb sodimms to the AMD APU for short money.
32. jwr ◴[] No.40056153[source]
Hmm. I'm running decent LLMs locally (deepseek-coder:33b-instruct-q8_0, mistral:7b-instruct-v0.2-q8_0, mixtral:8x7b-instruct-v0.1-q4_0) on my MacBook Pro and they respond pretty quickly. At least for interactive use they are fine and comparable to Anthropic Opus in speed.

That MacBook has an M3 Max and 64GB RAM.

I'd say it does live up to my expectations, perhaps even slightly exceeds them.

33. programd ◴[] No.40056550{4}[source]
Depends on what you're doing. Just chatting with the AI?

I'm getting about 7 tokens per sec for Mistral with the Q6_K on a bog standard Intel i5-11400 desktop with 32G of memory and no discrete GPU (the CPU has Intel UHD Graphics 730 built in). 2 year old low end CPU that goes for, what $150? these days. As far as I'm concerned that's conversational speed. Pop in some 8 core modern CPU and I'm betting you can double that, without even involving any GPU.

People way overestimate what they need in order to play around with models these days. Use llama.cpp and buy that extra $80 worth of RAM and pay about half the price of a comparable Mac all in. Bigger models? Buy more RAM, which is very cheap these days.

There's a $487 special on Newegg today with an i7-12700KF, motherboard and 32G of ram. Add another $300 worth of case, power supply, SSD and more RAM and you're under the price of a Macbook Air. There's your LLM inference machine (not for training obviously) which can run even the 70B models at home at acceptable conversational speed.

34. chaostheory ◴[] No.40056761[source]
The problem is that it extends to both Mac Studio and Mac Pro.
35. chaostheory ◴[] No.40056817[source]
You summed up my point better than I did
36. dragonwriter ◴[] No.40059613[source]
> The fact a laptop can run 70B+ parameter models is a miracle

Most laptops with 64+GB of RAM can run a 70B model at 4-bit quantization. It’s not a miracle, it’s just math. M2 can do it faster than systems with slower memory bandwidth.

replies(1): >>40064678 #
37. dragonwriter ◴[] No.40059721[source]
> I have a gaming rig with a 4080 with 16GB of RAM and it can’t even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized.

Which Mixtral? What does “heavily quantized” mean?

> A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM

Not assuming the 4090 is in a computer with non-trivial system RAM (which it kind of needs to be), since models can be split between GPU and CPU. The M1 Max might have better performance for models that take >24GB and <= 32GB than the 4090 machine with 24GB of VRAM, but assuming the 4090 is in a machine with 16GB+ of system RAM, it will be able to run bigger models than the M2, as well as outperforming it for models requiring up to 24GB of RAM.

The system I actually use is a gaming laptop with a 16GB 3080Ti and 64GB of system RAM, and Mixtral 8x7B @ 4-bit (~30GB RAM) works.

38. edward28 ◴[] No.40063026[source]
I would take those benchmarks with a grain of salt, given that it shows a 96 core epyc loosing in multi thread to a 64 core epyc and a 32 core xeon.
39. ◴[] No.40064678{3}[source]
40. 0x457 ◴[] No.40067455{5}[source]
While there are many people "that work outside Silicon Valley, where the entire company uses Windows centrally managed through Active Directory, and explaining IT that you need a Mac is an uphill battle" I think these companies either don't care about AI or get into it by acquiring a startup (that runs on macs).

Companies that run windows and AD are too busy to make sure you move your mouse every 5 minutes while you're on the clock more. At least that is my experience.

replies(1): >>40076476 #
41. MichaelZuo ◴[] No.40076476{6}[source]
There are some genuine, non-Dilbert-esque uses for AD. Where there really isn't a viable, similarly performant, alternative.
42. dangus ◴[] No.40077726{6}[source]
I’m still seeing over a 100% uplift comparing mobile to mobile on Nvidia products: https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...

As far as desktop products, power consumption is irrelevant.