Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

nico ◴[20 Apr 25 14:59 UTC] No.43744205[source]▶

>>43743896 #

Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short

Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?

replies(3): >>43744252 #>>43744469 #>>43747471 #

simonw ◴[20 Apr 25 15:06 UTC] No.43744252[source]▶

>>43744205 #

The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...

replies(1): >>43744320 #

nico ◴[20 Apr 25 15:16 UTC] No.43744320{3}[source]▶

>>43744252 #

Been playing with that, but doesn’t seem to have much effect. It works very well to limit output to smaller bits, like setting it to 100-200. But above 2-4k the output seems to never get longer than about 1 page

Might try using the models with mlx instead of ollama to see if that makes a difference

Any tips on prompting to get longer outputs?

Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?

replies(1): >>43744532 #

simonw ◴[20 Apr 25 15:47 UTC] No.43744532{4}[source]▶

>>43744320 #

Interestingly the Gemma 3 docs say: https://ai.google.dev/gemma/docs/core/model_card_3#:~:text=T...

> Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size per request, subtracting the request input tokens

I don't know how to get it to output anything that length though.

replies(1): >>43744700 #

nico ◴[20 Apr 25 16:14 UTC] No.43744700{5}[source]▶

>>43744532 #

Thank you for the insights and useful links

Will keep experimenting, will also try mistral3.1

edit: just tried mistral3.1 and the quality of the output is very good, at least compared to the other models I tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and deepseek-r1:14b)

Doing some research, because of their training sets, it seems like most models are not trained on producing long outputs so even if they technically could, they won’t. Might require developing my own training dataset and then doing some fine tuning. Apparently the models and ollama have some safeguards against rambling and repetition

replies(1): >>43746033 #

Gracana ◴[20 Apr 25 19:43 UTC] No.43746033{6}[source]▶

>>43744700 #

You can probably find some long-form tuned models on HF. I've had decent results with QwQ-32B (which I can run on my desktop) and Mistral Large (which I have to run on my server). Generating and refining an outline before writing the whole piece can help, and you can also split the piece up into multiple outputs (working a paragraph or two at a time, for instance). So far I've found it to be a tough process, with mixed results.

replies(1): >>43746653 #

nico ◴[20 Apr 25 21:28 UTC] No.43746653{7}[source]▶

>>43746033 #

Thank you, will try out your suggestions

Have you used something like a director model to supervise the output? If so, could you comment on the effectiveness of it and potentially any tips?

replies(1): >>43747588 #

1. Gracana ◴[21 Apr 25 00:27 UTC] No.43747588{8}[source]▶

>>43746653 #

Nope, sounds neat though. There's so much to keep up with in this space.

↑