←back to thread

544 points tosh | 6 comments | | HN request time: 1.032s | source | bottom
Show context
simonw ◴[] No.43464243[source]
32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).
replies(9): >>43464289 #>>43464380 #>>43464443 #>>43464588 #>>43464688 #>>43467991 #>>43468940 #>>43469099 #>>43470619 #
faizshah ◴[] No.43464688[source]
I just started self hosting as well on my local machine, been using https://lmstudio.ai/ Locally for now.

I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.

I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.

I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.

Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.

If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.

replies(3): >>43464710 #>>43465059 #>>43470007 #
wetwater ◴[] No.43464710[source]
Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally? I am grad student on budget but I want to host one locally and trying to build a PC that could run one of these models.
replies(6): >>43464785 #>>43464973 #>>43464999 #>>43465270 #>>43465970 #>>43468258 #
coder543 ◴[] No.43464973[source]
"B" just means "billion". A 7B model has 7 billion parameters. Most models are trained in fp16, so each parameter takes two bytes at full precision. Therefore, 7B = 14GB of memory. You can easily quantize models to 8 bits per parameter with very little quality loss, so then 7B = 7GB of memory. With more quality loss (making the model dumber), you can quantize to 4 bits per parameter, so 7B = 3.5GB of memory. There are ways to quantize at other levels too, anywhere from under 2 bits per parameter up to 6 bits per parameter are common.

There is additional memory used for context / KV cache. So, if you use a large context window for a model, you will need to factor in several additional gigabytes for that, but it is much harder to provide a rule of thumb for that overhead. Most of the time, the overhead is significantly less than the size of the model, so not 2x or anything. (The size of the context window is related to the amount of text/images that you can have in a conversation before the LLM begins forgetting the earlier parts of the conversation.)

The most important thing for local LLM performance is typically memory bandwidth. This is why GPUs are so much faster for LLM inference than CPUs, since GPU VRAM is many times the speed of CPU RAM. Apple Silicon offers rather decent memory bandwidth, which makes the performance fit somewhere between a typical Intel/AMD CPU and a typical GPU. Apple Silicon is definitely not as fast as a discrete GPU with the same amount of VRAM.

That's about all you need to know to get started. There are obviously nuances and exceptions that apply in certain situations.

A 32B model at 5 bits per parameter will comfortably fit onto a 24GB GPU and provide decent speed, as long as the context window isn't set to a huge value.

replies(2): >>43466151 #>>43467190 #
epolanski ◴[] No.43466151[source]
So, in essence, all AMD does to launch a successful GPU in inference space is to load it with ram?
replies(2): >>43466327 #>>43473969 #
1. TrueDuality ◴[] No.43466327[source]
AMD's limitation is more of a software problem than a hardware problem at this point.
replies(1): >>43467168 #
2. AuryGlenz ◴[] No.43467168[source]
But it’s still surprising they haven’t. People would be motivated as hell if they launched GPUs with twice the amount of VRAM. It’s not as simple as just soldering some more in but still.
replies(3): >>43467235 #>>43470016 #>>43471640 #
3. wruza ◴[] No.43467235[source]
AMD “just” has to write something like CUDA overnight. Imagine you’re in 1995 and have to ship Kubuntu 24.04 LTS this summer running on your S3 Virge.
replies(1): >>43468279 #
4. mirekrusin ◴[] No.43468279{3}[source]
They don't need to do anything software wise, inference is solved problem for AMD.
5. regularfry ◴[] No.43470016[source]
Funnily enough you can buy GPUs where someone has done exactly that: solder extra VRAM into a stock model.
6. thomastjeffery ◴[] No.43471640[source]
They sort of have. I'm using a 7900xtx, which has 24gb of vram. The next competitor would be a 4090, which would cost more than double today; granted, that would be much faster.

Technically there is also the 3090, which is more comparable price wise. I don't know about performance, though.

VRAM is supply limited enough that going bigger isn't as easy as it sounds. AMD can probably sell as much as they get their hands on, so they may as well still more GPUs, too.