Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0.208s | source

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

tomrod ◴[20 Apr 25 15:00 UTC] No.43744215[source]▶

>>43743896 #

Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).

replies(1): >>43744258 #

simonw ◴[20 Apr 25 15:07 UTC] No.43744258[source]▶

>>43744215 #

MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.

replies(2): >>43744971 #>>43749324 #

jychang ◴[21 Apr 25 07:44 UTC] No.43749324[source]▶

>>43744258 #

MLX is slower than GGUFs on Macs.

On my M1 Max macbook pro, the GGUF version bartowski/google_gemma-3-27b-it-qat-GGUF is 15.6gb and runs at 17tok/sec, whereas mlx-community/gemma-3-27b-it-qat-4bit is 16.8gb and runs at 15tok/sec. Note that both of these are the new QAT 4bit quants.

replies(1): >>43752864 #

phaedrix ◴[21 Apr 25 15:12 UTC] No.43752864[source]▶

>>43749324 #

No, in general mlx versions are always faster, ice tested most of them.

replies(1): >>43753241 #

1. 85392_school ◴[21 Apr 25 15:44 UTC] No.43753241[source]▶

>>43752864 #

What TPS difference are you getting?

↑