Most active commenters

simonw(7)
pantulis(6)
ein0p(4)

Popular/hot comments

>>43744513 #
>>43744807 #
>>43745820 #
>>43750648 #

←back to thread

Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

Show context

simonw ◴[20 Apr 25 14:14 UTC] No.43743896[source]▶

>>43743337 (OP) #

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #

1. rs186 ◴[20 Apr 25 14:22 UTC] No.43743949[source]▶

>>43743896 #

Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

replies(11): >>43744051 #>>43744387 #>>43744850 #>>43745587 #>>43745615 #>>43746287 #>>43746724 #>>43747164 #>>43748620 #>>43750648 #>>43758570 #

2. simonw ◴[20 Apr 25 14:36 UTC] No.43744051[source]▶

>>43743949 (TP) #

My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

I enjoy local models for research and for the occasional offline scenario.

I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

replies(2): >>43744385 #>>43748537 #

3. freeamz ◴[20 Apr 25 15:26 UTC] No.43744385[source]▶

>>43744051 #

>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.

replies(2): >>43744513 #>>43751797 #

4. overfeed ◴[20 Apr 25 15:27 UTC] No.43744387[source]▶

>>43743949 (TP) #

> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.

You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.

Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.

replies(2): >>43746346 #>>43748697 #

5. simonw ◴[20 Apr 25 15:45 UTC] No.43744513{3}[source]▶

>>43744385 #

"Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.

replies(5): >>43744600 #>>43744716 #>>43747248 #>>43748353 #>>43748456 #

6. terhechte ◴[20 Apr 25 15:59 UTC] No.43744600{4}[source]▶

>>43744513 #

Or GitHub. I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub. All big llm providers offer no-training-on-your-data business plans.

replies(2): >>43744807 #>>43751939 #

7. __float ◴[20 Apr 25 16:17 UTC] No.43744716{4}[source]▶

>>43744513 #

While none of that is false, I think there's a big difference from shipping your data to an external LLM API and using AWS.

Using AWS is basically a "physical server they have control of".

replies(2): >>43745030 #>>43745490 #

8. tarruda ◴[20 Apr 25 16:34 UTC] No.43744807{5}[source]▶

>>43744600 #

> I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub

What amuses me even more is people thinking their code is too unique and precious, and that GitHub/Microsoft wants to steal it.

replies(3): >>43744838 #>>43744843 #>>43746078 #

9. AlexCoventry ◴[20 Apr 25 16:40 UTC] No.43744838{6}[source]▶

>>43744807 #

Concern about platform risk in regard to Microsoft is historically justified.

10. Terretta ◴[20 Apr 25 16:41 UTC] No.43744843{6}[source]▶

>>43744807 #

Unlikely they think Microsoft or GitHub wants to steal it.

With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.

But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.

* https://en.wikipedia.org/wiki/Blinkenlights

11. otabdeveloper4 ◴[20 Apr 25 16:42 UTC] No.43744850[source]▶

>>43743949 (TP) #

The only actually useful application of LLM's is processing large amounts of data for classification and/or summarizing purposes.

That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.

("AI assistant" is an evolutionary dead end, and Star Trek be damned.)

12. simonw ◴[20 Apr 25 17:11 UTC] No.43745030{5}[source]▶

>>43744716 #

That's why AWS Bedrock and Google Vertex AI and Azure AI model inference exist - they're all hosted LLM services that offer the same compliance guarantees that you get from regular AWS-style hosting agreements.

13. IanCal ◴[20 Apr 25 18:22 UTC] No.43745490{5}[source]▶

>>43744716 #

As in aws is a much bigger security concern?

14. DJHenk ◴[20 Apr 25 18:36 UTC] No.43745587[source]▶

>>43743949 (TP) #

> More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

There is another aspect to consider, aside from privacy.

These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.

However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.

replies(1): >>43748924 #

15. ein0p ◴[20 Apr 25 18:40 UTC] No.43745615[source]▶

>>43743949 (TP) #

Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.

replies(2): >>43745771 #>>43747204 #

16. lodovic ◴[20 Apr 25 19:03 UTC] No.43745771[source]▶

>>43745615 #

This is a really cool idea. Do you pretrain the model so it can tag people? I have so many photo's that it seems impossible to ever categorize them,using a workflow like yours might help a lot

replies(1): >>43745820 #

17. ein0p ◴[20 Apr 25 19:10 UTC] No.43745820{3}[source]▶

>>43745771 #

No, tagging of people is already handled by another model. Gemma just describes what's in the image, and produces a comma separated list of keywords. No additional training is required besides a few tweaks to the prompt so that it outputs just the description, without any "fluff". E.g. it normally prepends such outputs with "Here's a description of the image:" unless you really insist that it should output only the description. I suppose I could use constrained decoding into JSON or something to achieve the same, but I didn't mess with that.

On some images where Gemma3 struggles Mistral Small produces better descriptions, BTW. But it seems harder to make it follow my instructions exactly.

I'm looking forward to the day when I can also do this with videos, a lot of which I also have no interest in uploading to someone else's computer.

replies(3): >>43746325 #>>43748621 #>>43748921 #

18. vikarti ◴[20 Apr 25 19:50 UTC] No.43746078{6}[source]▶

>>43744807 #

Regulations sometimes matter. Stupid "security" rules sometimes matter too.

19. k__ ◴[20 Apr 25 20:23 UTC] No.43746287[source]▶

>>43743949 (TP) #

The local LLM is your project manager, the big remote ones are the engineers and designers :D

20. fer ◴[20 Apr 25 20:30 UTC] No.43746325{4}[source]▶

>>43745820 #

How do you use the keywords after? I have Immich running which does some analysis, but the querying is a bit of a hit and miss.

replies(1): >>43746405 #

21. rs186 ◴[20 Apr 25 20:33 UTC] No.43746346[source]▶

>>43744387 #

I have a 4070 super for gaming, and used it to play with LLM a few times. It is by no means a bad card, but I realize that unless I want to get 4090 or new Macs that I don't have any other use for, I can only use it to run smaller models. However, most smaller models aren't satisfactory and are still slower than hosted LLMs. I haven't found a model that I am happy with for my hardware.

Regarding agentic workflows -- sounds nice but I am too scared to try it out, based on my experience with standard LLMs like GPT or Claude for writing code. Small snippets or filling in missing unit tests, fine, anything more complicated? Has been a disaster for me.

replies(1): >>43748998 #

22. ein0p ◴[20 Apr 25 20:44 UTC] No.43746405{5}[source]▶

>>43746325 #

Search is indeed hit and miss. Immich, for instance, currently does absolutely nothing with the EXIF "description" field, so I store textual descriptions on the side as well. I have found Immich's search by image embeddings to be pretty weak at recall, and even weaker at ranking. IIRC Lightroom Classic (which I also use, but haven't found a way to automate this for without writing an extension) does search that field, but ranking is a bit of a dumpster fire, so your best bet is searching uncommon terms or constraining search by metadata (e.g. not just "black kitten" but "black kitten AND 2025"). I expect this to improve significantly over time - it's a fairly obvious thing to add given the available tech.

23. trees101 ◴[20 Apr 25 21:43 UTC] No.43746724[source]▶

>>43743949 (TP) #

Not sure how accurate my stats are. I used ollama with the --verbose flag. Using a 4090 and all default settings, I get 40TPS for Gemma 29B model

`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS

`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS

Strange results; the full model gives me slightly more TPS.

replies(1): >>43747364 #

24. starik36 ◴[20 Apr 25 22:59 UTC] No.43747164[source]▶

>>43743949 (TP) #

On an A5000 with 24GB, this model typically gets between 20 to 25 tps.

25. starik36 ◴[20 Apr 25 23:07 UTC] No.43747204[source]▶

>>43745615 #

I was thinking of doing the same, but I would like to include people's name. in the description. For example "Jennifer looking out in the desert sky.".

As it stands, Gemma will just say "Woman looking out in the desert sky."

replies(1): >>43749446 #

26. ipdashc ◴[20 Apr 25 23:17 UTC] No.43747248{4}[source]▶

>>43744513 #

Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but why does it suddenly feel like everyone cares about data privacy in LLM contexts, way more than previous attitudes to allowing data to sit on a bunch of random SaaS products?

I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?

Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)

replies(2): >>43747566 #>>43749913 #

27. orangecat ◴[20 Apr 25 23:40 UTC] No.43747364[source]▶

>>43746724 #

ollama's `gemma3:27b` is also 4-bit quantized, you need `27b-it-q8_0` for 8 bit or `27b-it-fp16` for FP16. See https://ollama.com/library/gemma3/tags.

28. pornel ◴[21 Apr 25 00:22 UTC] No.43747566{5}[source]▶

>>43747248 #

It is due to the risk of a leak.

Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

replies(2): >>43747740 #>>43748126 #

29. 6510 ◴[21 Apr 25 01:10 UTC] No.43747740{6}[source]▶

>>43747566 #

Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.

30. vbezhenar ◴[21 Apr 25 02:46 UTC] No.43748126{6}[source]▶

>>43747566 #

How can you be sure that AWS will not use your data to train their models? They got enormous data, probably most data in the world.

replies(1): >>43748498 #

31. mjlee ◴[21 Apr 25 03:43 UTC] No.43748353{4}[source]▶

>>43744513 #

AWS has a strong track record, a clear business model that isn’t predicated on gathering as much data as possible, and an awful lot to lose if they break their promises.

Lots of AI companies have some of these, but not to the same extent.

32. Tepix ◴[21 Apr 25 04:14 UTC] No.43748456{4}[source]▶

>>43744513 #

on-premises

https://twominenglish.com/premise-vs-premises/

33. simonw ◴[21 Apr 25 04:27 UTC] No.43748498{7}[source]▶

>>43748126 #

Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.

There is no world in which training on customer data without permission would be worth it for AWS.

Your data really isn't that useful anyway.

replies(1): >>43756131 #

34. triyambakam ◴[21 Apr 25 04:37 UTC] No.43748537[source]▶

>>43744051 #

> specifically for dealing with extremely sensitive data like leaked information from confidential sources.

Can you explain this further? It seems in contrast to your previous comment about trusting Anthropic with your data

replies(1): >>43748676 #

35. a_e_k ◴[21 Apr 25 05:02 UTC] No.43748620[source]▶

>>43743949 (TP) #

I'm seeing ~38--42 tps on a 4090 in a fresh build of llama.cpp under Fedora 42 on my personal machine.

(-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m models/gemma-3-27b-it-qat-q4_0.gguf)

36. ethersteeds ◴[21 Apr 25 05:02 UTC] No.43748621{4}[source]▶

>>43745820 #

> No, tagging of people is already handled by another model.

As an aside, what model/tools do you prefer for tagging people?

37. simonw ◴[21 Apr 25 05:14 UTC] No.43748676{3}[source]▶

>>43748537 #

I trust Anthropic not to train on my data.

If they get hit by a government subpoena because a journalist has been using them to analyze leaked corporate or government secret files I also trust them to honor that subpoena.

Sometimes journalists deal with material that they cannot risk leaving their own machine.

"News is what somebody somewhere wants to suppress"

38. adastra22 ◴[21 Apr 25 05:21 UTC] No.43748697[source]▶

>>43744387 #

I have never found any agent able to put together sensible pull requests without constant hand holding. I shudder to think of what those repositories must look like.

39. mentalgear ◴[21 Apr 25 06:18 UTC] No.43748921{4}[source]▶

>>43745820 #

Since you already seem to have done some impressive work on this for your personal use, would you mind open sourcing it?

40. ◴[21 Apr 25 06:19 UTC] No.43748924[source]▶

>>43745587 #

41. taneq ◴[21 Apr 25 06:39 UTC] No.43748998{3}[source]▶

>>43746346 #

As I understand it, these models are limited on GPU memory far more than GPU compute. You’d be better off with dual 4070s than with a single 4090 unless the 4090 has more RAM than the other two combined.

42. ein0p ◴[21 Apr 25 08:07 UTC] No.43749446{3}[source]▶

>>43747204 #

Most search rankers do not consider word order, so if you could also append the person's name at the end of text description, it'd probably work well enough for retrieval and ranking at least.

If you want natural language to resolve the names, that'd at a minimum require bounding boxes of the faces and their corresponding names. It'd also require either preprocessing, or specialized training, or both. To my knowledge no locally-hostable model as of today has that. I don't know if any proprietary models can do this either, but it's certainly worth a try - they might just do it. The vast majority of the things they can do is emergent, meaning they were never specifically trained to do them.

43. freeamz ◴[21 Apr 25 09:27 UTC] No.43749913{5}[source]▶

>>43747248 #

In Scandinavian financial related severs must in the country! That always sounded like a sane approach. The whole putting your data on saas or AWS just seems like the same "Let's shift the responsibility to a big player".

Any important data should NOT be in devices that is NOT physically with in our jurisdiction.

44. pantulis ◴[21 Apr 25 11:21 UTC] No.43750648[source]▶

>>43743949 (TP) #

> Can you quote tps?

LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.

replies(3): >>43750755 #>>43756621 #>>43805414 #

45. jychang ◴[21 Apr 25 11:37 UTC] No.43750755[source]▶

>>43750648 #

That's pretty terrible. I'm getting 18tok/sec Gemma 3 27b QAT on a M1 Max 32gb macbook.

replies(2): >>43750823 #>>43805417 #

46. pantulis ◴[21 Apr 25 11:45 UTC] No.43750823{3}[source]▶

>>43750755 #

Yeah, I know. Not sure if this due to something in LLM Studio or whatever.

47. belter ◴[21 Apr 25 13:27 UTC] No.43751797{3}[source]▶

>>43744385 #

> "Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

Most companies physical and digital security controls are so much worst than anything from AWS or Google. Note I dont include Azure...but a physical server they have control of is a phrase that screams vulnerability.

48. mdp2021 ◴[21 Apr 25 20:30 UTC] No.43756131{8}[source]▶

>>43748498 #

> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

replies(1): >>43756371 #

49. simonw ◴[21 Apr 25 20:56 UTC] No.43756371{9}[source]▶

>>43756131 #

There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.

That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.

Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

replies(1): >>43757227 #

50. kristianp ◴[21 Apr 25 21:20 UTC] No.43756621[source]▶

>>43750648 #

Is QAT a different quantisation format to Q4_0? Can you try "gemma-3-27b-it-qat" for a model: https://lmstudio.ai/model/gemma-3-27b-it-qat

replies(1): >>43761457 #

51. mdp2021 ◴[21 Apr 25 22:31 UTC] No.43757227{10}[source]▶

>>43756371 #

Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).

52. jonaustin ◴[22 Apr 25 02:29 UTC] No.43758570[source]▶

>>43743949 (TP) #

On a M4 Max 128GB via LM Studio:

query: "make me a snake game in python with pygame"

(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token

(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token

using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...

replies(1): >>43765225 #

53. pantulis ◴[22 Apr 25 12:45 UTC] No.43761457{3}[source]▶

>>43756621 #

Thanks for your suggestion!

I think it was just the filename. I tried the model you suggested by opening it in LLMStudio and keep getting 8.3 tps.

replies(1): >>43761485 #

54. pantulis ◴[22 Apr 25 12:48 UTC] No.43761485{4}[source]▶

>>43761457 #

Now this is raising my curiosity, is there anything else I could try and tweak to achieve better tps? Can it be related to being GGUF instead of MLX?

55. yencabulator ◴[22 Apr 25 19:04 UTC] No.43765225[source]▶

>>43758570 #

I genuinely would have expected a $3,500+ setup to do better than just 10x pure-CPU on a AMD Ryzen 9 8945HS.

replies(1): >>43801505 #

56. jonaustin ◴[26 Apr 25 07:01 UTC] No.43801505{3}[source]▶

>>43765225 #

Find another laptop that does that well.

57. pantulis ◴[26 Apr 25 17:18 UTC] No.43805414[source]▶

>>43750648 #

Gah, turns out I was running the Mac in low power mode!

I get 24tps in LM Studio now with gemma-3-27b-it-qat.

58. pantulis ◴[26 Apr 25 17:19 UTC] No.43805417{3}[source]▶

>>43750755 #

I was running the mac in low power mode!!! Getting 24tps now.

↑