Most active commenters
  • parched99(4)

←back to thread

602 points emrah | 18 comments | | HN request time: 0.001s | source | bottom
1. holografix ◴[] No.43743631[source]
Could 16gb vram be enough for the 27b QAT version?
replies(5): >>43743634 #>>43743704 #>>43743825 #>>43744249 #>>43756253 #
2. halflings ◴[] No.43743634[source]
That's what the chart says yes. 14.1GB VRAM usage for the 27B model.
replies(1): >>43743678 #
3. erichocean ◴[] No.43743678[source]
That's the VRAM required just to load the model weights.

To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.

replies(1): >>43743834 #
4. jffry ◴[] No.43743704[source]
With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory usage is just a hair over 20GB, so no, probably not without a nerfed context window
replies(1): >>43743804 #
5. woadwarrior01 ◴[] No.43743804[source]
Indeed, the default context length in ollama is a mere 2048 tokens.
6. hskalin ◴[] No.43743825[source]
With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)
replies(2): >>43744666 #>>43752342 #
7. oezi ◴[] No.43743834{3}[source]
I didn't realize that the context would require such so much memory. Is this KV caches? It would seem like a big advantage if this memory requirement could be reduced.
8. parched99 ◴[] No.43744249[source]
I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.

Prompt Tokens: 10

Time: 229.089 ms

Speed: 43.7 t/s

Generation Tokens: 41

Time: 959.412 ms

Speed: 42.7 t/s

replies(3): >>43745881 #>>43746002 #>>43747323 #
9. senko ◴[] No.43744666[source]
I'm doing that with a 12GB card, ollama supports it out of the box.

For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.

Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.

10. floridianfisher ◴[] No.43745881[source]
Try one of the smaller versions. 27b is too big for your gpu
replies(1): >>43746177 #
11. tbocek ◴[] No.43746002[source]
This is probably due to this: https://github.com/ggml-org/llama.cpp/issues/12637. This GitHub issue is about interleaved sliding window attention (iSWA) not available in llama.cpp for Gemma 3. This could reduce the memory requirements a lot. They mentioned for a certain scenario, going from 62GB to 10GB.
replies(2): >>43746296 #>>43749521 #
12. parched99 ◴[] No.43746177{3}[source]
I'm aware. I was addressing the question being asked.
13. parched99 ◴[] No.43746296{3}[source]
Resolving that issue, would help reduce (not eliminate) the size of the context. The model will still only just barely fit in 16 GB, which is what the parent comment asked.

Best to have two or more low-end, 16GB GPUs for a total of 32GB VRAM to run most of the better local models.

14. idonotknowwhy ◴[] No.43747323[source]
I didn't realise the 5070 is slower than the 3090. Thanks.

If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.

Also an imatrix gguf like iq4xs might be smaller with better quality

replies(1): >>43747892 #
15. parched99 ◴[] No.43747892{3}[source]
I answered the question directly. IQ4_X_S is smaller, but slower and less accurate than Q4_0. The parent comment specifically asked about the QAT version. That's literally what this thread is about. The context-length mention was relevant to show how it's only barely usable.
16. nolist_policy ◴[] No.43749521{3}[source]
Ollama supports iSWA.
17. dockerd ◴[] No.43752342[source]
Does it work on LM Studio? Loading 27b-it-qat taking up more than 22GB on 24GB mac.
18. abawany ◴[] No.43756253[source]
I tried the 27b-iat model on a 4090m with 16gb vram with mostly default args via llama.cpp and it didn't fit - used up the vram and tried to use about 2gb of system ram: performance in this setup was < 5 tps.