Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

rs186 ◴[20 Apr 25 14:22 UTC] No.43743949[source]▶

>>43743896 #

Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

replies(11): >>43744051 #>>43744387 #>>43744850 #>>43745587 #>>43745615 #>>43746287 #>>43746724 #>>43747164 #>>43748620 #>>43750648 #>>43758570 #

simonw ◴[20 Apr 25 14:36 UTC] No.43744051[source]▶

>>43743949 #

My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

I enjoy local models for research and for the occasional offline scenario.

I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

replies(2): >>43744385 #>>43748537 #

freeamz ◴[20 Apr 25 15:26 UTC] No.43744385[source]▶

>>43744051 #

>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.

replies(2): >>43744513 #>>43751797 #

simonw ◴[20 Apr 25 15:45 UTC] No.43744513[source]▶

>>43744385 #

"Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.

replies(5): >>43744600 #>>43744716 #>>43747248 #>>43748353 #>>43748456 #

terhechte ◴[20 Apr 25 15:59 UTC] No.43744600[source]▶

>>43744513 #

Or GitHub. I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub. All big llm providers offer no-training-on-your-data business plans.

replies(2): >>43744807 #>>43751939 #

tarruda ◴[20 Apr 25 16:34 UTC] No.43744807[source]▶

>>43744600 #

> I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub

What amuses me even more is people thinking their code is too unique and precious, and that GitHub/Microsoft wants to steal it.

replies(3): >>43744838 #>>43744843 #>>43746078 #

1. vikarti ◴[20 Apr 25 19:50 UTC] No.43746078{3}[source]▶

>>43744807 #

Regulations sometimes matter. Stupid "security" rules sometimes matter too.

↑