Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

602 points emrah | 2 comments | 20 Apr 25 12:22 UTC | HN request time: 0.653s | source

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

rs186 ◴[20 Apr 25 14:22 UTC] No.43743949[source]▶

>>43743896 #

Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

replies(11): >>43744051 #>>43744387 #>>43744850 #>>43745587 #>>43745615 #>>43746287 #>>43746724 #>>43747164 #>>43748620 #>>43750648 #>>43758570 #

1. DJHenk ◴[20 Apr 25 18:36 UTC] No.43745587[source]▶

>>43743949 #

> More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

There is another aspect to consider, aside from privacy.

These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.

However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.

replies(1): >>43748924 #

2. ◴[21 Apr 25 06:19 UTC] No.43748924[source]▶

>>43745587 (TP) #

↑