Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

tomrod ◴[20 Apr 25 15:00 UTC] No.43744215[source]▶

>>43743896 #

Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).

replies(1): >>43744258 #

simonw ◴[20 Apr 25 15:07 UTC] No.43744258[source]▶

>>43744215 #

MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.

replies(2): >>43744971 #>>43749324 #

1. Elucalidavah ◴[20 Apr 25 17:01 UTC] No.43744971{3}[source]▶

>>43744258 #

> MacBook Pro M2 with 64GB of RAM

Are there non-mac options with similar capabilities?

replies(5): >>43745043 #>>43745718 #>>43746431 #>>43747338 #>>43747789 #

2. simonw ◴[20 Apr 25 17:13 UTC] No.43745043[source]▶

>>43744971 (TP) #

Yes, but I don't really know anything about those. https://www.reddit.com/r/LocalLLaMA/ is full of people running models on PCs with NVIDIA cards.

The unique benefit of an Apple Silicon Mac at the moment is that the 64GB of RAM is available to both the GPU and the CPU at once. With other hardware you usually need dedicated separate VRAM for the GPU.

3. _neil ◴[20 Apr 25 18:55 UTC] No.43745718[source]▶

>>43744971 (TP) #

It’s not out yet, but the upcoming Framework desktop [0] is supposed to have a similar unified memory setup.

[0] https://frame.work/desktop

4. dwood_dev ◴[20 Apr 25 20:49 UTC] No.43746431[source]▶

>>43744971 (TP) #

Anything with the Radeon 8060S/Ryzen AI Max+ 395. One of the popular MiniPC Chinese brands has them for preorder[0] with shipping starting May 7th. Framework also has them, but shipping Q3.

0: https://www.gmktec.com/products/prepaid-deposit-amd-ryzen™-a...

replies(1): >>43747791 #

5. danans ◴[20 Apr 25 23:34 UTC] No.43747338[source]▶

>>43744971 (TP) #

Nvidia Orin AGX if a desktop form factor works for you.

6. chpatrick ◴[21 Apr 25 01:21 UTC] No.43747789[source]▶

>>43744971 (TP) #

I remember seeing a post about someone running the full size DeepSeek model in a dual-Xeon server with a ton of RAM.

7. chpatrick ◴[21 Apr 25 01:22 UTC] No.43747791[source]▶

>>43746431 #

I've never been able to get ROCm working reliably personally.

↑