←back to thread

172 points marban | 4 comments | | HN request time: 0.849s | source
Show context
InTheArena ◴[] No.40051885[source]
While everyone has focused on Apple's power-efficiency on the M series chips, one thing that has been very interesting is how powerful the unified memory model (by having the memory on-package with CPU) with large bandwidth to the memory actually is. Hence a lot of people in the local LLMA community are really going after high-memory Macs.

It's great to see NPUs here with the new Ryzen cores - but I wonder how effective they will be with off-die memory versus the Apple approach.

That said, it's nothing but great to see these capabilities in something other then a expensive NVIDIA card. Local NPUs may really help with edge deploying more conferencing capabilities.

Edited - sorry, ,meant on-package.

replies(8): >>40051950 #>>40052032 #>>40052167 #>>40052857 #>>40053126 #>>40054064 #>>40054570 #>>40054743 #
chaostheory ◴[] No.40052032[source]
What Apple has is theoretically great on paper, but it fails to live up to expectations. Whats the point of having the RAM for running an LLM locally when the performance is abysmal compared to running it on even a consumer Nvidia GPU. It’s a missed opportunity that I hope either the M4 or M5 addresses
replies(8): >>40052327 #>>40052344 #>>40052929 #>>40053695 #>>40053835 #>>40054577 #>>40054855 #>>40056153 #
evilduck ◴[] No.40053695[source]
That completely depends on your expectations and uses.

I have a gaming rig with a 4080 with 16GB of RAM and it can't even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized. Yeah it's fast when something fits on it, but I don't see much point in very fast generation of bad output. A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM and for ~$300 less, and it's a whole computer instead of a single part that still needs a computer around it. Compared to my 4080, that GPU and the surrounding computer get you half the VRAM capacity for greater cost than the Mac.

If you're building a rig with multiple GPUs to support many users or for internal private services and are willing to drop more than $3k then I think the equation swings back in favor of Nvidia, but not until then.

replies(4): >>40053808 #>>40053912 #>>40053916 #>>40059721 #
1. soupbowl ◴[] No.40053808[source]
Just buy another 16gb of ram for 80$....
replies(2): >>40053856 #>>40054038 #
2. elzbardico ◴[] No.40053856[source]
GPU ram?
3. evilduck ◴[] No.40054038[source]
Running your larger-than-your-GPU-VRAM LLM model on regular DDR ram will completely slaughter your token/s speed to the point that the Mac comes out ahead again.
replies(1): >>40056550 #
4. programd ◴[] No.40056550[source]
Depends on what you're doing. Just chatting with the AI?

I'm getting about 7 tokens per sec for Mistral with the Q6_K on a bog standard Intel i5-11400 desktop with 32G of memory and no discrete GPU (the CPU has Intel UHD Graphics 730 built in). 2 year old low end CPU that goes for, what $150? these days. As far as I'm concerned that's conversational speed. Pop in some 8 core modern CPU and I'm betting you can double that, without even involving any GPU.

People way overestimate what they need in order to play around with models these days. Use llama.cpp and buy that extra $80 worth of RAM and pay about half the price of a comparable Mac all in. Bigger models? Buy more RAM, which is very cheap these days.

There's a $487 special on Newegg today with an i7-12700KF, motherboard and 32G of ram. Add another $300 worth of case, power supply, SSD and more RAM and you're under the price of a Macbook Air. There's your LLM inference machine (not for training obviously) which can run even the 70B models at home at acceptable conversational speed.