←back to thread

221 points whitefables | 2 comments | | HN request time: 0.001s | source
Show context
varun_ch ◴[] No.41856480[source]
I’m curious about how good the performance with local LLMs is on ‘outdated’ hardware like the author’s 2060. I have a desktop with a 2070 super that it could be fun to turn into an “AI server” if I had the time…
replies(7): >>41856521 #>>41856558 #>>41856559 #>>41856609 #>>41856875 #>>41856894 #>>41857543 #
1. magicalhippo ◴[] No.41856558[source]
I've been playing with some LLMs like Llama 3 and Gemma on my 2080Ti. If it fits in GPU memory the inference speed is quite decent.

However I've found quality of smaller models to be quite lacking. The Llama 3.2 3B for example is much worse than Gemma2 9B, which is the one I found performs best while fitting comfortably.

Actual sentences are fine, but it doesn't follow prompts as well and it doesn't "understand" the context very well.

Quantization brings down memory cost, but there seems to be a sharp decline below 5 bits for those I tried. So a larger but heavily quantized model usually performs worse, at least with the models I've tried so far.

So with only 6GB of GPU memory I think you either have to accept the hit on inference speed by only partially offloading, or accept fairly low model quality.

Doesn't mean the smaller models can't be useful, but don't expect ChatGPT 4o at home.

That said if you got a beefy CPU then it can be reasonable to have it do a few of the layers.

Personally I found Gemma2 9B quantized to 6 bit IIRC to be quite useful. YMMV.

replies(1): >>41857653 #
2. magicalhippo ◴[] No.41857653[source]
Yes, gemma-2-9b-it-Q6_K_L is the one that works well for me.

I tried gemma-2-27b-it-Q4_K_L but it's not as good, despite being larger.

Using llama.cpp and models from here[1].

[1]: https://huggingface.co/bartowski