Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

1. paprots ◴[20 Apr 25 19:00 UTC] No.43745751[source]▶

>>43743896 #

The original gemma3:27b also took only 22GB using Ollama on my 64GB MacBook. I'm quite confused that the QAT took the same. Do you know why? Which model is better? `gemma3:27b`, or `gemma3:27b-qat`?

replies(4): >>43745787 #>>43746372 #>>43746439 #>>43747386 #

2. nolist_policy ◴[20 Apr 25 19:05 UTC] No.43745787[source]▶

>>43745751 (TP) #

I suspect your "original gemma3:27b" was a quantized model since the non-quantized (16bit) version needs around 54gb.

3. kgwgk ◴[20 Apr 25 20:39 UTC] No.43746372[source]▶

>>43745751 (TP) #

Look up 27b in https://ollama.com/library/gemma3/tags

You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M

replies(1): >>43838949 #

4. superkuh ◴[20 Apr 25 20:50 UTC] No.43746439[source]▶

>>43745751 (TP) #

Quantization aware training just means having the model deal with quantized values a bit during training so it handles the quantization better when it is quantized after training/etc. It doesn't change the model size itself.

5. zorgmonkey ◴[20 Apr 25 23:44 UTC] No.43747386[source]▶

>>43745751 (TP) #

Both versions are quantized and should use the same amount of RAM, the difference with QAT is the quantization happens during training time and it should result in slightly better (closer to the bf16 weights) output

6. carbocation ◴[29 Apr 25 22:46 UTC] No.43838949[source]▶

>>43746372 #

Just following this comment up as a note-to-self: just as `kgwgk noted, the default gemma3:27B model has ID a418f5838eaf, which corresponds to 27b-it-q4_K_M. But the new gemma3:27B quantization-aware training (QAT) model being discussed is gemma3:27b-it-qat with ID 29eb0b9aeda3.

↑