Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

rs186 ◴[20 Apr 25 14:22 UTC] No.43743949[source]▶

>>43743896 #

Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

replies(11): >>43744051 #>>43744387 #>>43744850 #>>43745587 #>>43745615 #>>43746287 #>>43746724 #>>43747164 #>>43748620 #>>43750648 #>>43758570 #

ein0p ◴[20 Apr 25 18:40 UTC] No.43745615[source]▶

>>43743949 #

Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.

replies(2): >>43745771 #>>43747204 #

lodovic ◴[20 Apr 25 19:03 UTC] No.43745771[source]▶

>>43745615 #

This is a really cool idea. Do you pretrain the model so it can tag people? I have so many photo's that it seems impossible to ever categorize them,using a workflow like yours might help a lot

replies(1): >>43745820 #

ein0p ◴[20 Apr 25 19:10 UTC] No.43745820[source]▶

>>43745771 #

No, tagging of people is already handled by another model. Gemma just describes what's in the image, and produces a comma separated list of keywords. No additional training is required besides a few tweaks to the prompt so that it outputs just the description, without any "fluff". E.g. it normally prepends such outputs with "Here's a description of the image:" unless you really insist that it should output only the description. I suppose I could use constrained decoding into JSON or something to achieve the same, but I didn't mess with that.

On some images where Gemma3 struggles Mistral Small produces better descriptions, BTW. But it seems harder to make it follow my instructions exactly.

I'm looking forward to the day when I can also do this with videos, a lot of which I also have no interest in uploading to someone else's computer.

replies(3): >>43746325 #>>43748621 #>>43748921 #

1. mentalgear ◴[21 Apr 25 06:18 UTC] No.43748921[source]▶

>>43745820 #

Since you already seem to have done some impressive work on this for your personal use, would you mind open sourcing it?

↑