Most active commenters
  • simonw(7)
  • pantulis(6)
  • ein0p(4)

←back to thread

602 points emrah | 58 comments | | HN request time: 0.002s | source | bottom
Show context
simonw ◴[] No.43743896[source]
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/
replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #
1. rs186 ◴[] No.43743949[source]
Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

replies(11): >>43744051 #>>43744387 #>>43744850 #>>43745587 #>>43745615 #>>43746287 #>>43746724 #>>43747164 #>>43748620 #>>43750648 #>>43758570 #
2. simonw ◴[] No.43744051[source]
My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

I enjoy local models for research and for the occasional offline scenario.

I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

replies(2): >>43744385 #>>43748537 #
3. freeamz ◴[] No.43744385[source]
>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.

replies(2): >>43744513 #>>43751797 #
4. overfeed ◴[] No.43744387[source]
> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.

You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.

Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.

replies(2): >>43746346 #>>43748697 #
5. simonw ◴[] No.43744513{3}[source]
"Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.

replies(5): >>43744600 #>>43744716 #>>43747248 #>>43748353 #>>43748456 #
6. terhechte ◴[] No.43744600{4}[source]
Or GitHub. I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub. All big llm providers offer no-training-on-your-data business plans.
replies(2): >>43744807 #>>43751939 #
7. __float ◴[] No.43744716{4}[source]
While none of that is false, I think there's a big difference from shipping your data to an external LLM API and using AWS.

Using AWS is basically a "physical server they have control of".

replies(2): >>43745030 #>>43745490 #
8. tarruda ◴[] No.43744807{5}[source]
> I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub

What amuses me even more is people thinking their code is too unique and precious, and that GitHub/Microsoft wants to steal it.

replies(3): >>43744838 #>>43744843 #>>43746078 #
9. AlexCoventry ◴[] No.43744838{6}[source]
Concern about platform risk in regard to Microsoft is historically justified.
10. Terretta ◴[] No.43744843{6}[source]
Unlikely they think Microsoft or GitHub wants to steal it.

With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.

But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.

* https://en.wikipedia.org/wiki/Blinkenlights

11. otabdeveloper4 ◴[] No.43744850[source]
The only actually useful application of LLM's is processing large amounts of data for classification and/or summarizing purposes.

That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.

("AI assistant" is an evolutionary dead end, and Star Trek be damned.)

12. simonw ◴[] No.43745030{5}[source]
That's why AWS Bedrock and Google Vertex AI and Azure AI model inference exist - they're all hosted LLM services that offer the same compliance guarantees that you get from regular AWS-style hosting agreements.
13. IanCal ◴[] No.43745490{5}[source]
As in aws is a much bigger security concern?
14. DJHenk ◴[] No.43745587[source]
> More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

There is another aspect to consider, aside from privacy.

These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.

However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.

replies(1): >>43748924 #
15. ein0p ◴[] No.43745615[source]
Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.
replies(2): >>43745771 #>>43747204 #
16. lodovic ◴[] No.43745771[source]
This is a really cool idea. Do you pretrain the model so it can tag people? I have so many photo's that it seems impossible to ever categorize them,using a workflow like yours might help a lot
replies(1): >>43745820 #
17. ein0p ◴[] No.43745820{3}[source]
No, tagging of people is already handled by another model. Gemma just describes what's in the image, and produces a comma separated list of keywords. No additional training is required besides a few tweaks to the prompt so that it outputs just the description, without any "fluff". E.g. it normally prepends such outputs with "Here's a description of the image:" unless you really insist that it should output only the description. I suppose I could use constrained decoding into JSON or something to achieve the same, but I didn't mess with that.

On some images where Gemma3 struggles Mistral Small produces better descriptions, BTW. But it seems harder to make it follow my instructions exactly.

I'm looking forward to the day when I can also do this with videos, a lot of which I also have no interest in uploading to someone else's computer.

replies(3): >>43746325 #>>43748621 #>>43748921 #
18. vikarti ◴[] No.43746078{6}[source]
Regulations sometimes matter. Stupid "security" rules sometimes matter too.
19. k__ ◴[] No.43746287[source]
The local LLM is your project manager, the big remote ones are the engineers and designers :D
20. fer ◴[] No.43746325{4}[source]
How do you use the keywords after? I have Immich running which does some analysis, but the querying is a bit of a hit and miss.
replies(1): >>43746405 #
21. rs186 ◴[] No.43746346[source]
I have a 4070 super for gaming, and used it to play with LLM a few times. It is by no means a bad card, but I realize that unless I want to get 4090 or new Macs that I don't have any other use for, I can only use it to run smaller models. However, most smaller models aren't satisfactory and are still slower than hosted LLMs. I haven't found a model that I am happy with for my hardware.

Regarding agentic workflows -- sounds nice but I am too scared to try it out, based on my experience with standard LLMs like GPT or Claude for writing code. Small snippets or filling in missing unit tests, fine, anything more complicated? Has been a disaster for me.

replies(1): >>43748998 #
22. ein0p ◴[] No.43746405{5}[source]
Search is indeed hit and miss. Immich, for instance, currently does absolutely nothing with the EXIF "description" field, so I store textual descriptions on the side as well. I have found Immich's search by image embeddings to be pretty weak at recall, and even weaker at ranking. IIRC Lightroom Classic (which I also use, but haven't found a way to automate this for without writing an extension) does search that field, but ranking is a bit of a dumpster fire, so your best bet is searching uncommon terms or constraining search by metadata (e.g. not just "black kitten" but "black kitten AND 2025"). I expect this to improve significantly over time - it's a fairly obvious thing to add given the available tech.
23. trees101 ◴[] No.43746724[source]
Not sure how accurate my stats are. I used ollama with the --verbose flag. Using a 4090 and all default settings, I get 40TPS for Gemma 29B model

`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS

`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS

Strange results; the full model gives me slightly more TPS.

replies(1): >>43747364 #
24. starik36 ◴[] No.43747164[source]
On an A5000 with 24GB, this model typically gets between 20 to 25 tps.
25. starik36 ◴[] No.43747204[source]
I was thinking of doing the same, but I would like to include people's name. in the description. For example "Jennifer looking out in the desert sky.".

As it stands, Gemma will just say "Woman looking out in the desert sky."

replies(1): >>43749446 #
26. ipdashc ◴[] No.43747248{4}[source]
Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but why does it suddenly feel like everyone cares about data privacy in LLM contexts, way more than previous attitudes to allowing data to sit on a bunch of random SaaS products?

I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?

Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)

replies(2): >>43747566 #>>43749913 #
27. orangecat ◴[] No.43747364[source]
ollama's `gemma3:27b` is also 4-bit quantized, you need `27b-it-q8_0` for 8 bit or `27b-it-fp16` for FP16. See https://ollama.com/library/gemma3/tags.
28. pornel ◴[] No.43747566{5}[source]
It is due to the risk of a leak.

Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

replies(2): >>43747740 #>>43748126 #
29. 6510 ◴[] No.43747740{6}[source]
Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.
30. vbezhenar ◴[] No.43748126{6}[source]
How can you be sure that AWS will not use your data to train their models? They got enormous data, probably most data in the world.
replies(1): >>43748498 #
31. mjlee ◴[] No.43748353{4}[source]
AWS has a strong track record, a clear business model that isn’t predicated on gathering as much data as possible, and an awful lot to lose if they break their promises.

Lots of AI companies have some of these, but not to the same extent.

32. Tepix ◴[] No.43748456{4}[source]
on-premises

https://twominenglish.com/premise-vs-premises/

33. simonw ◴[] No.43748498{7}[source]
Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.

There is no world in which training on customer data without permission would be worth it for AWS.

Your data really isn't that useful anyway.

replies(1): >>43756131 #
34. triyambakam ◴[] No.43748537[source]
> specifically for dealing with extremely sensitive data like leaked information from confidential sources.

Can you explain this further? It seems in contrast to your previous comment about trusting Anthropic with your data

replies(1): >>43748676 #
35. a_e_k ◴[] No.43748620[source]
I'm seeing ~38--42 tps on a 4090 in a fresh build of llama.cpp under Fedora 42 on my personal machine.

(-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m models/gemma-3-27b-it-qat-q4_0.gguf)

36. ethersteeds ◴[] No.43748621{4}[source]
> No, tagging of people is already handled by another model.

As an aside, what model/tools do you prefer for tagging people?

37. simonw ◴[] No.43748676{3}[source]
I trust Anthropic not to train on my data.

If they get hit by a government subpoena because a journalist has been using them to analyze leaked corporate or government secret files I also trust them to honor that subpoena.

Sometimes journalists deal with material that they cannot risk leaving their own machine.

"News is what somebody somewhere wants to suppress"

38. adastra22 ◴[] No.43748697[source]
I have never found any agent able to put together sensible pull requests without constant hand holding. I shudder to think of what those repositories must look like.
39. mentalgear ◴[] No.43748921{4}[source]
Since you already seem to have done some impressive work on this for your personal use, would you mind open sourcing it?
40. ◴[] No.43748924[source]
41. taneq ◴[] No.43748998{3}[source]
As I understand it, these models are limited on GPU memory far more than GPU compute. You’d be better off with dual 4070s than with a single 4090 unless the 4090 has more RAM than the other two combined.
42. ein0p ◴[] No.43749446{3}[source]
Most search rankers do not consider word order, so if you could also append the person's name at the end of text description, it'd probably work well enough for retrieval and ranking at least.

If you want natural language to resolve the names, that'd at a minimum require bounding boxes of the faces and their corresponding names. It'd also require either preprocessing, or specialized training, or both. To my knowledge no locally-hostable model as of today has that. I don't know if any proprietary models can do this either, but it's certainly worth a try - they might just do it. The vast majority of the things they can do is emergent, meaning they were never specifically trained to do them.

43. freeamz ◴[] No.43749913{5}[source]
In Scandinavian financial related severs must in the country! That always sounded like a sane approach. The whole putting your data on saas or AWS just seems like the same "Let's shift the responsibility to a big player".

Any important data should NOT be in devices that is NOT physically with in our jurisdiction.

44. pantulis ◴[] No.43750648[source]
> Can you quote tps?

LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.

replies(3): >>43750755 #>>43756621 #>>43805414 #
45. jychang ◴[] No.43750755[source]
That's pretty terrible. I'm getting 18tok/sec Gemma 3 27b QAT on a M1 Max 32gb macbook.
replies(2): >>43750823 #>>43805417 #
46. pantulis ◴[] No.43750823{3}[source]
Yeah, I know. Not sure if this due to something in LLM Studio or whatever.
47. belter ◴[] No.43751797{3}[source]
> "Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

Most companies physical and digital security controls are so much worst than anything from AWS or Google. Note I dont include Azure...but a physical server they have control of is a phrase that screams vulnerability.

48. mdp2021 ◴[] No.43756131{8}[source]
> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

replies(1): >>43756371 #
49. simonw ◴[] No.43756371{9}[source]
There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.

That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.

Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

replies(1): >>43757227 #
50. kristianp ◴[] No.43756621[source]
Is QAT a different quantisation format to Q4_0? Can you try "gemma-3-27b-it-qat" for a model: https://lmstudio.ai/model/gemma-3-27b-it-qat
replies(1): >>43761457 #
51. mdp2021 ◴[] No.43757227{10}[source]
Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).

52. jonaustin ◴[] No.43758570[source]
On a M4 Max 128GB via LM Studio:

query: "make me a snake game in python with pygame"

(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token

(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token

using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...

replies(1): >>43765225 #
53. pantulis ◴[] No.43761457{3}[source]
Thanks for your suggestion!

I think it was just the filename. I tried the model you suggested by opening it in LLMStudio and keep getting 8.3 tps.

replies(1): >>43761485 #
54. pantulis ◴[] No.43761485{4}[source]
Now this is raising my curiosity, is there anything else I could try and tweak to achieve better tps? Can it be related to being GGUF instead of MLX?
55. yencabulator ◴[] No.43765225[source]
I genuinely would have expected a $3,500+ setup to do better than just 10x pure-CPU on a AMD Ryzen 9 8945HS.
replies(1): >>43801505 #
56. jonaustin ◴[] No.43801505{3}[source]
Find another laptop that does that well.
57. pantulis ◴[] No.43805414[source]
Gah, turns out I was running the Mac in low power mode!

I get 24tps in LM Studio now with gemma-3-27b-it-qat.

58. pantulis ◴[] No.43805417{3}[source]
I was running the mac in low power mode!!! Getting 24tps now.