←back to thread

602 points emrah | 3 comments | | HN request time: 0.001s | source
Show context
simonw ◴[] No.43743896[source]
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/
replies(11): >>43743949 #>>43744205 #>>43744215 #>>43745256 #>>43745751 #>>43746252 #>>43746789 #>>43747326 #>>43747968 #>>43752580 #>>43752951 #
rs186 ◴[] No.43743949[source]
Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

replies(11): >>43744051 #>>43744387 #>>43744850 #>>43745587 #>>43745615 #>>43746287 #>>43746724 #>>43747164 #>>43748620 #>>43750648 #>>43758570 #
simonw ◴[] No.43744051[source]
My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

I enjoy local models for research and for the occasional offline scenario.

I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

replies(2): >>43744385 #>>43748537 #
freeamz ◴[] No.43744385[source]
>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.

replies(2): >>43744513 #>>43751797 #
simonw ◴[] No.43744513[source]
"Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.

replies(5): >>43744600 #>>43744716 #>>43747248 #>>43748353 #>>43748456 #
ipdashc ◴[] No.43747248[source]
Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but why does it suddenly feel like everyone cares about data privacy in LLM contexts, way more than previous attitudes to allowing data to sit on a bunch of random SaaS products?

I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?

Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)

replies(2): >>43747566 #>>43749913 #
pornel ◴[] No.43747566[source]
It is due to the risk of a leak.

Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

replies(2): >>43747740 #>>43748126 #
vbezhenar ◴[] No.43748126[source]
How can you be sure that AWS will not use your data to train their models? They got enormous data, probably most data in the world.
replies(1): >>43748498 #
simonw ◴[] No.43748498[source]
Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.

There is no world in which training on customer data without permission would be worth it for AWS.

Your data really isn't that useful anyway.

replies(1): >>43756131 #
1. mdp2021 ◴[] No.43756131[source]
> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

replies(1): >>43756371 #
2. simonw ◴[] No.43756371[source]
There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.

That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.

Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

replies(1): >>43757227 #
3. mdp2021 ◴[] No.43757227[source]
Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).