Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

602 points emrah | 4 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

1. prvc ◴[20 Apr 25 20:16 UTC] No.43746252[source]▶

>>43743896 #

> ~15GB (MLX) leaving plenty of memory for running other apps.

Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?

replies(1): >>43746578 #

2. simonw ◴[20 Apr 25 21:14 UTC] No.43746578[source]▶

>>43746252 (TP) #

I expect not. On my Mac at least I've found I need a bunch of GB free to have anything else running at all.

replies(1): >>43746663 #

3. mnoronha ◴[20 Apr 25 21:31 UTC] No.43746663[source]▶

>>43746578 #

Any idea why MLX and ollama use such different amounts of ram?

replies(1): >>43749263 #

4. jychang ◴[21 Apr 25 07:32 UTC] No.43749263{3}[source]▶

>>43746663 #

I don't think ollama is quantizing the embeddings table, which is still full FP16.

If you're using MLX, that means you're on a mac, in which case ollama actually isn't your best option. Either directly use llama.cpp if you're a power user, or use LM Studio if you want something a bit better than ollama but more user friendly than llama.cpp. (LM Studio has a GUI and is also more user friendly than ollama, but has the downsides of not being as scriptable. You win some, you lose some.)

Don't use MLX, it's not as fast/small as the best GGUFs currently (and also tends to be more buggy, it currently has some known bugs with japanese). Download the LM Studio version of the Gemma 3 QAT GGUF quants, which are made by Bartowski. Google actually directly mentions Bartowski in blog post linked above (ctrl-f his name), and his models are currently the best ones to use.

https://huggingface.co/bartowski/google_gemma-3-27b-it-qat-G...

The "best Gemma 3 27b model to download" crown has taken a very roundabout path. After the initial Google release, it went from Unsloth Q4_K_M, to Google QAT Q4_0, to stduhpf Q4_0_S, to Bartowski Q4_0 now.

↑