←back to thread

172 points marban | 1 comments | | HN request time: 0.335s | source
Show context
InTheArena ◴[] No.40051885[source]
While everyone has focused on Apple's power-efficiency on the M series chips, one thing that has been very interesting is how powerful the unified memory model (by having the memory on-package with CPU) with large bandwidth to the memory actually is. Hence a lot of people in the local LLMA community are really going after high-memory Macs.

It's great to see NPUs here with the new Ryzen cores - but I wonder how effective they will be with off-die memory versus the Apple approach.

That said, it's nothing but great to see these capabilities in something other then a expensive NVIDIA card. Local NPUs may really help with edge deploying more conferencing capabilities.

Edited - sorry, ,meant on-package.

replies(8): >>40051950 #>>40052032 #>>40052167 #>>40052857 #>>40053126 #>>40054064 #>>40054570 #>>40054743 #
chaostheory ◴[] No.40052032[source]
What Apple has is theoretically great on paper, but it fails to live up to expectations. Whats the point of having the RAM for running an LLM locally when the performance is abysmal compared to running it on even a consumer Nvidia GPU. It’s a missed opportunity that I hope either the M4 or M5 addresses
replies(8): >>40052327 #>>40052344 #>>40052929 #>>40053695 #>>40053835 #>>40054577 #>>40054855 #>>40056153 #
evilduck ◴[] No.40053695[source]
That completely depends on your expectations and uses.

I have a gaming rig with a 4080 with 16GB of RAM and it can't even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized. Yeah it's fast when something fits on it, but I don't see much point in very fast generation of bad output. A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM and for ~$300 less, and it's a whole computer instead of a single part that still needs a computer around it. Compared to my 4080, that GPU and the surrounding computer get you half the VRAM capacity for greater cost than the Mac.

If you're building a rig with multiple GPUs to support many users or for internal private services and are willing to drop more than $3k then I think the equation swings back in favor of Nvidia, but not until then.

replies(4): >>40053808 #>>40053912 #>>40053916 #>>40059721 #
1. dragonwriter ◴[] No.40059721[source]
> I have a gaming rig with a 4080 with 16GB of RAM and it can’t even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized.

Which Mixtral? What does “heavily quantized” mean?

> A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM

Not assuming the 4090 is in a computer with non-trivial system RAM (which it kind of needs to be), since models can be split between GPU and CPU. The M1 Max might have better performance for models that take >24GB and <= 32GB than the 4090 machine with 24GB of VRAM, but assuming the 4090 is in a machine with 16GB+ of system RAM, it will be able to run bigger models than the M2, as well as outperforming it for models requiring up to 24GB of RAM.

The system I actually use is a gaming laptop with a 16GB 3080Ti and 64GB of system RAM, and Mixtral 8x7B @ 4-bit (~30GB RAM) works.