The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.
Am I missing something?
These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.
Is there something more to this, or is just a follow up blog post?
(is it just that ollama finally has partial (no images right?) support? Or something else?)
To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.
How is this more significant now than when they were uploaded 2 weeks ago?
Are we expecting new models? I don’t understand the timing. This post feels like it’s two weeks late.
[1] - https://huggingface.co/collections/google/gemma-3-qat-67ee61...
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.
It is important to know about both to decide between the two for your use case though.
> 17 days ago
Anywaaay...
I'm literally asking, quite honestly, if this is just an 'after the fact' update literally weeks later, that they uploaded a bunch of models, or if there is something more significant about this I'm missing.
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
The partnership with Ollama and MLX and LM Studio and llama.cpp was revealed in that announcement, which made the models a lot easier for people to use.
Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm -m mlx-community/gemma-3-27b-it-qat-4bit \
-f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
-f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
-s 'Write a new fragments plugin in Python that registers
issue:org/repo/123 which fetches that issue
number from the specified github repo and uses the same
markdown logic as the HTML page to turn that into a
fragment'
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png
Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
Prompt Tokens: 10
Time: 229.089 ms
Speed: 43.7 t/s
Generation Tokens: 41
Time: 959.412 ms
Speed: 42.7 t/s
That said, if you really care, it generates faster than reading speed (on an A18 based model at least).
For example, if I ask mistral small who I am by name, it will say there is no known notable figure by that name before the knowledge cutoff. Gemma 3 will say I am a well known <random profession> and make up facts. On the other hand, I have asked both about local organization in my area that I am involved with, and Gemma 3 could produce useful and factual information, where Mistral Small said it did not know.
Last time we only released the quantized GGUFs. Only llama.cpp users could use it (+ Ollama, but without vision).
Now, we released the unquantized checkpoints, so anyone can quantize themselves and use in their favorite tools, including Ollama with vision, MLX, LM Studio, etc. MLX folks also found that the model worked decently with 3 bits compared to naive 3-bit, so by releasing the unquantized checkpoints we allow further experimentation and research.
TL;DR. One was a release in a specific format/tool, we followed-up with a full release of artifacts that enable the community to do much more.
Might try using the models with mlx instead of ollama to see if that makes a difference
Any tips on prompting to get longer outputs?
Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?
I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.
However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.
Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
Whatever those keyword things are, they certainly don't seem to be doing any form of RAG.
By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.
I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.
> Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size per request, subtracting the request input tokens
I don't know how to get it to output anything that length though.
My best guess is that there's not enough discussion/development related to Powershell in training data.
(That's a massive simplification of how any of this works, but it's how I think about it at a high level.)
For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.
Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.
If this is your main use case you can always try to fine tune a model. I maintain a small llm bench of different programming languages and the performance difference between say Python and Rust on some smaller models is up to 70%
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
Will keep experimenting, will also try mistral3.1
edit: just tried mistral3.1 and the quality of the output is very good, at least compared to the other models I tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and deepseek-r1:14b)
Doing some research, because of their training sets, it seems like most models are not trained on producing long outputs so even if they technically could, they won’t. Might require developing my own training dataset and then doing some fine tuning. Apparently the models and ollama have some safeguards against rambling and repetition
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
What amuses me even more is people thinking their code is too unique and precious, and that GitHub/Microsoft wants to steal it.
With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.
But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.
That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.
("AI assistant" is an evolutionary dead end, and Star Trek be damned.)
Are there non-mac options with similar capabilities?
The unique benefit of an Apple Silicon Mac at the moment is that the 64GB of RAM is available to both the GPU and the CPU at once. With other hardware you usually need dedicated separate VRAM for the GPU.
Why is the memory use different? Are you using different context size in both set-ups?
Wish they showed benchmarks / added quantized versions to the arena! :>
https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?
> Last month, we launched Gemma 3, our latest generation of open models. Delivering state-of-the-art performance, Gemma 3 quickly established itself as a leading model capable of running on a single high-end GPU like the NVIDIA H100 using its native BFloat16 (BF16) precision.
> To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality.
The thing that's new, and that is clearly resonating with people, is the "To make Gemma 3 even more accessible..." bit.
There is another aspect to consider, aside from privacy.
These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.
However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.
"An iteration on a theme".
Once the network design is proven to work yes it's an impressive technical achievement, but as I've said given I've known people in multiple research institutes and companies using Gemma3 for a month mostly saying they're surprised it's not getting noticed... This is just enabling more users but the none QAT version will almost always perform better...
gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
On some images where Gemma3 struggles Mistral Small produces better descriptions, BTW. But it seems harder to make it follow my instructions exactly.
I'm looking forward to the day when I can also do this with videos, a lot of which I also have no interest in uploading to someone else's computer.
Best to have two or more low-end, 16GB GPUs for a total of 32GB VRAM to run most of the better local models.
HN works best when people engage in good faith, stay curious, and try to move the conversation forward. That kind of tone — even when technically accurate — discourages others from participating and derails meaningful discussion.
If you’re getting downvotes regularly, maybe it's worth considering how your comments are landing with others, not just whether they’re “right.”
Regarding agentic workflows -- sounds nice but I am too scared to try it out, based on my experience with standard LLMs like GPT or Claude for writing code. Small snippets or filling in missing unit tests, fine, anything more complicated? Has been a disaster for me.
You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M
0: https://www.gmktec.com/products/prepaid-deposit-amd-ryzen™-a...
The high end consumer card from Nvidia is the RTX 5090, and the professional version of the card is the RTX PRO 6000.
I think these consumer GPUs are way too expensive for the amount of memory they pack - and that's intentional price discrimination. Also the builds are gimmicky. It's just not setup for AI models, and the versions that are cost 20k.
AMD has that 128GB RAM strix halo chip but even with soldered ram the bandwidth there is very limited, half of M4 Max, which is half of 4090.
I think this generation of hardware and local models is not there yet - would wait for M5/M6 release.
`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS
`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS
Strange results; the full model gives me slightly more TPS.
Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.
Ask it to generate SVG, and it’s very simple and almost too dumb.
Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.
It's definitely not extremely high end any more, the price is(at least here) the same as the new mid range consumer cards.
I guess the price can vary by location, but €1800 for a 3090 is crazy, that's more than the new price in 2020.
As it stands, Gemma will just say "Woman looking out in the desert sky."
I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?
Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)
If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.
Also an imatrix gguf like iq4xs might be smaller with better quality
Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.
Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.
Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu
Tbh I give up writing that in response to this rant. My polite poke holds and it's non insulting so I'm not going to capitulate to those childish enough to not look inwards.
Lots of AI companies have some of these, but not to the same extent.
There is no world in which training on customer data without permission would be worth it for AWS.
Your data really isn't that useful anyway.
Can you explain this further? It seems in contrast to your previous comment about trusting Anthropic with your data
As an aside, what model/tools do you prefer for tagging people?
If they get hit by a government subpoena because a journalist has been using them to analyze leaked corporate or government secret files I also trust them to honor that subpoena.
Sometimes journalists deal with material that they cannot risk leaving their own machine.
"News is what somebody somewhere wants to suppress"
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
> Be kind. Don't be snarky.
> Please don't post shallow dismissals, especially of other people's work.
In my opinion, your comment is not in line with the guidelines. Especially the part about sillytavern being the only LLM frontend that matters. Telling the devs of any LLM frontend except sillytavern that their app doesn't matter seems exactly like a shallow dismissal of other people's work to me.
If you're using MLX, that means you're on a mac, in which case ollama actually isn't your best option. Either directly use llama.cpp if you're a power user, or use LM Studio if you want something a bit better than ollama but more user friendly than llama.cpp. (LM Studio has a GUI and is also more user friendly than ollama, but has the downsides of not being as scriptable. You win some, you lose some.)
Don't use MLX, it's not as fast/small as the best GGUFs currently (and also tends to be more buggy, it currently has some known bugs with japanese). Download the LM Studio version of the Gemma 3 QAT GGUF quants, which are made by Bartowski. Google actually directly mentions Bartowski in blog post linked above (ctrl-f his name), and his models are currently the best ones to use.
https://huggingface.co/bartowski/google_gemma-3-27b-it-qat-G...
The "best Gemma 3 27b model to download" crown has taken a very roundabout path. After the initial Google release, it went from Unsloth Q4_K_M, to Google QAT Q4_0, to stduhpf Q4_0_S, to Bartowski Q4_0 now.
On my M1 Max macbook pro, the GGUF version bartowski/google_gemma-3-27b-it-qat-GGUF is 15.6gb and runs at 17tok/sec, whereas mlx-community/gemma-3-27b-it-qat-4bit is 16.8gb and runs at 15tok/sec. Note that both of these are the new QAT 4bit quants.
The Gemma 3 17b QAT GGUF should be taking up ~15gb, not 22gb.
If you want natural language to resolve the names, that'd at a minimum require bounding boxes of the faces and their corresponding names. It'd also require either preprocessing, or specialized training, or both. To my knowledge no locally-hostable model as of today has that. I don't know if any proprietary models can do this either, but it's certainly worth a try - they might just do it. The vast majority of the things they can do is emergent, meaning they were never specifically trained to do them.
So for devices that have lots of mem but weaker processing power it can get you similar output quality but faster. So tends to do better on CPU and APU like setups
Any important data should NOT be in devices that is NOT physically with in our jurisdiction.
24b would be too small to run on device and I'm trying to keep my cloud costs low (meaning I can't afford to host a small 24b 24/7).
24b would be too small to run on device and I'm trying to keep my cloud costs low (meaning I can't afford to host a small 27b 24/7).
To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.
Most companies physical and digital security controls are so much worst than anything from AWS or Google. Note I dont include Azure...but a physical server they have control of is a phrase that screams vulnerability.
The privacy concerns are honestly mostly imaginary at this point, too. Plenty of hosted LLM vendors will promise not to train on your data. The bigger threat is if they themselves log data and then have a security incident, but honestly the risk that your own personal machine gets stolen or hacked is a lot higher than that.
? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...
Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.
That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.
Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933
> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.
If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.
And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).
There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.
The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.
Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.
This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.
query: "make me a snake game in python with pygame"
(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token
(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token
using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...