(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0.001s | source

Show context

holografix ◴[20 Apr 25 13:27 UTC] No.43743631[source]▶

Could 16gb vram be enough for the 27b QAT version?

replies(5): >>43743634 #>>43743704 #>>43743825 #>>43744249 #>>43756253 #

parched99 ◴[20 Apr 25 15:06 UTC] No.43744249[source]▶

I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.

Prompt Tokens: 10

Time: 229.089 ms

Speed: 43.7 t/s

Generation Tokens: 41

Time: 959.412 ms

Speed: 42.7 t/s

replies(3): >>43745881 #>>43746002 #>>43747323 #

idonotknowwhy ◴[20 Apr 25 23:31 UTC] No.43747323{3}[source]▶

>>43744249 #

I didn't realise the 5070 is slower than the 3090. Thanks.

If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.

Also an imatrix gguf like iq4xs might be smaller with better quality

replies(1): >>43747892 #

1. parched99 ◴[21 Apr 25 01:50 UTC] No.43747892{4}[source]▶

>>43747323 #

I answered the question directly. IQ4_X_S is smaller, but slower and less accurate than Q4_0. The parent comment specifically asked about the QAT version. That's literally what this thread is about. The context-length mention was relevant to show how it's only barely usable.

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs