(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

Show context

holografix ◴[20 Apr 25 13:27 UTC] No.43743631[source]▶

Could 16gb vram be enough for the 27b QAT version?

replies(5): >>43743634 #>>43743704 #>>43743825 #>>43744249 #>>43756253 #

parched99 ◴[20 Apr 25 15:06 UTC] No.43744249[source]▶

I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.

Prompt Tokens: 10

Time: 229.089 ms

Speed: 43.7 t/s

Generation Tokens: 41

Time: 959.412 ms

Speed: 42.7 t/s

replies(3): >>43745881 #>>43746002 #>>43747323 #

tbocek ◴[20 Apr 25 19:40 UTC] No.43746002[source]▶

>>43744249 #

This is probably due to this: https://github.com/ggml-org/llama.cpp/issues/12637. This GitHub issue is about interleaved sliding window attention (iSWA) not available in llama.cpp for Gemma 3. This could reduce the memory requirements a lot. They mentioned for a certain scenario, going from 62GB to 10GB.

replies(2): >>43746296 #>>43749521 #

1. parched99 ◴[20 Apr 25 20:25 UTC] No.43746296[source]▶

>>43746002 #

Resolving that issue, would help reduce (not eliminate) the size of the context. The model will still only just barely fit in 16 GB, which is what the parent comment asked.

Best to have two or more low-end, 16GB GPUs for a total of 32GB VRAM to run most of the better local models.

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs