←back to thread

172 points marban | 1 comments | | HN request time: 0.232s | source
Show context
InTheArena ◴[] No.40051885[source]
While everyone has focused on Apple's power-efficiency on the M series chips, one thing that has been very interesting is how powerful the unified memory model (by having the memory on-package with CPU) with large bandwidth to the memory actually is. Hence a lot of people in the local LLMA community are really going after high-memory Macs.

It's great to see NPUs here with the new Ryzen cores - but I wonder how effective they will be with off-die memory versus the Apple approach.

That said, it's nothing but great to see these capabilities in something other then a expensive NVIDIA card. Local NPUs may really help with edge deploying more conferencing capabilities.

Edited - sorry, ,meant on-package.

replies(8): >>40051950 #>>40052032 #>>40052167 #>>40052857 #>>40053126 #>>40054064 #>>40054570 #>>40054743 #
chaostheory ◴[] No.40052032[source]
What Apple has is theoretically great on paper, but it fails to live up to expectations. Whats the point of having the RAM for running an LLM locally when the performance is abysmal compared to running it on even a consumer Nvidia GPU. It’s a missed opportunity that I hope either the M4 or M5 addresses
replies(8): >>40052327 #>>40052344 #>>40052929 #>>40053695 #>>40053835 #>>40054577 #>>40054855 #>>40056153 #
1. instagib ◴[] No.40054577[source]
One thing I would consider is usage throttling on a MacBook Pro. Would repeated LLM usage run into throttling?

No idea what specifically everyone is pulling their performance data from or what task(s).

Here is a video to help visualize the differences with a maxed out m3 max vs 16gbm1 pro vs 4090 on llm 7B/13b/70b llama 2. https://youtu.be/jaM02mb6JFM

Here’s a Reddit comparison of 4090 vs M2 Ultra 96gb with tokens/s

https://old.reddit.com/r/LocalLLaMA/comments/14319ra/rtx_409...

M3 pro memory BW 150 gb/s M3 max 10/30 300 gb/s M3 max 12/40 400 gb/s

“Llama models are mostly limited by memory bandwidth. rtx 3090 has 935.8 gb/s rtx 4090 has 1008 gb/s m2 ultra has 800 gb/s m2 max has 400 gb/s so 4090 is 10% faster for llama inference than 3090 and more than 2x faster than apple m2 max https://github.com/turboderp/exllama using exllama you can get 160 tokens/s in 7b model and 97 tokens/s in 13b model while m2 max has only 40 tokens/s in 7b model and 24 tokens/s in 13b apple 40/s Memory bandwidth cap is also the reason why llamas work so well on cpu (…)

buying second gpu will increase memory capacity to 48gb but has no effect on bandwidth so 2x 4090 will have 48gb vram and 1008 gb/s bandwidth and 50% utilization”