Most active commenters
  • whimsicalism(4)
  • diggan(3)

←back to thread

577 points simonw | 46 comments | | HN request time: 2.039s | source | bottom
1. NitpickLawyer ◴[] No.44723522[source]
> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.

Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.

And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)

By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.

replies(4): >>44723679 #>>44724534 #>>44726611 #>>44734796 #
2. tonyhart7 ◴[] No.44723679[source]
is GLM 4.5 better than Qwen3 coder??
replies(2): >>44723712 #>>44723745 #
3. diggan ◴[] No.44723712[source]
For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.

My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.

They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.

replies(1): >>44724438 #
4. NitpickLawyer ◴[] No.44723745[source]
I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).
5. kelvinjps10 ◴[] No.44724438{3}[source]
coding? they are coding models? what specific tasks is one performing better than the other?
replies(2): >>44724873 #>>44724912 #
6. genewitch ◴[] No.44724534[source]
I'll bite. How do i train/make and/or use LoRA, or, separately, how do i fine-tune? I've been asking this for months, and no one has a decent answer. websearch on my end is seo/geo-spam, with no real instructions.

I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?

replies(9): >>44724589 #>>44724702 #>>44724887 #>>44725233 #>>44725409 #>>44727383 #>>44727527 #>>44729225 #>>44731516 #
7. minimaxir ◴[] No.44724589[source]
If you're using Hugging Face transformers, the library you want to use is peft: https://huggingface.co/docs/peft/en/quicktour

There are Colab Notebook tutorials around training models with it as well.

8. notpublic ◴[] No.44724702[source]
https://github.com/unslothai/unsloth

I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.

9. diggan ◴[] No.44724873{4}[source]
They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.

> what specific tasks is one performing better than the other?

That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".

replies(1): >>44731614 #
10. techwizrd ◴[] No.44724887[source]
We have been fine-tuning models using Axolotl and Unsloth, with a slight preference for Axolotl. Check out the docs [0] and fine-tune or quantize your first model. There is a lot to be learned in this space, but it's exciting.

0: https://axolotl.ai/ and https://docs.axolotl.ai/

replies(2): >>44725288 #>>44725749 #
11. whimsicalism ◴[] No.44724912{4}[source]
glm 4.5 is not a coding model
replies(1): >>44724961 #
12. simonw ◴[] No.44724961{5}[source]
It may not be code-only, but it was trained extensively for coding:

> Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.

From my notes here: https://simonwillison.net/2025/Jul/28/glm-45/

replies(1): >>44724989 #
13. whimsicalism ◴[] No.44724989{6}[source]
yes, all reasoning models currently are, but it’s not like ds coder or qwen coder
replies(1): >>44725024 #
14. simonw ◴[] No.44725024{7}[source]
I don't see how the training process for GLM-4.5 is materially different from that used for Qwen3-235B-A22B-Instruct-2507 - they both did a ton of extra reinforcement learning training related to code.

Am I missing something?

replies(1): >>44725295 #
15. svachalek ◴[] No.44725233[source]
For completeness, for Apple hardware MLX is the way to go.
replies(1): >>44726845 #
16. syntaxing ◴[] No.44725288{3}[source]
What hardware do you train on using axolotl? I use unsloth with Google colab pro
17. whimsicalism ◴[] No.44725295{8}[source]
I think the primary thing you're missing is that Qwen3-235B-A22B-Instruct-2507 != Qwen3-Coder-480B-A35B-Instruct. And the difference there is that while both do tons of code RL, in one they do not monitor performance on anything else for forgetting/regression and focus fully on code post-training pipelines and it is not meant for other tasks.
18. qcnguy ◴[] No.44725409[source]
LLM fine tuning tends to destroy the model's capabilities if you aren't very careful. It's not as easy or effective as with image generation.
replies(2): >>44729336 #>>44732817 #
19. arkmm ◴[] No.44725749{3}[source]
When do you think fine tuning is worth it over prompt engineering a base model?

I imagine with the finetunes you have to worry about self-hosting, model utilization, and then also retraining the model as new base models come out. I'm curious under what circumstances you've found that the benefits outweigh the downsides.

replies(4): >>44726121 #>>44726652 #>>44726785 #>>44862452 #
20. whimsicalism ◴[] No.44726121{4}[source]
finetuning rarely makes sense unless you are an enterprise and even generally doesn't in most cases there either.
21. Nesco ◴[] No.44726611[source]
Zuck wouldn’t have leaked it on 4chan of all the places
replies(3): >>44726646 #>>44726851 #>>44743160 #
22. tough ◴[] No.44726646[source]
prob just told an employee to get it done no?
23. tough ◴[] No.44726652{4}[source]
only for narrow applications where your fine tune can let you use a smaller model locally , specialised and trained for your specific use-case mostly
24. reissbaker ◴[] No.44726785{4}[source]
For self-hosting, there are a few companies that offer per-token pricing for LoRA finetunes (LoRAs are basically efficient-to-train, efficient-to-host finetunes) of certain base models:

- (shameless plug) My company, Synthetic, supports LoRAs for Llama 3.1 8b and 70b: https://synthetic.new All you need to do is give us the Hugging Face repo and we take care of the rest. If you want other people to try your model, we charge usage to them rather than to you. (We can also host full finetunes of anything vLLM supports, although we charge by GPU-minute for full finetunes rather than the cheaper per-token pricing for supported base model LoRAs.)

- Together.ai supports a slightly wider number of base models than we do, with a bit more config required, and any usage is charged to you.

- Fireworks does the same as Together, although they quantize the models more heavily (FP4 for the higher-end models). However, they support Llama 4, which is pretty nice although fairly resource-intensive to train.

If you have reasonably good data for your task, and your task is relatively "narrow" (i.e. find a specific kind of bug, rather than general-purpose coding; extract a specific kind of data from legal documents rather than general-purpose reasoning about social and legal matters; etc), finetunes of even a very small model like an 8b will typically outperform — by a pretty wide margin — even very large SOTA models while being a lot cheaper to run. For example, if you find yourself hand-coding heuristics to fix some problem you're seeing with an LLM's responses, it's probably more robust to just train a small model finetune on the data and have the finetuned model fix the issues rather than writing hardcoded heuristics. On the other hand, no amount of finetuning will make an 8b model a better general-purpose coding agent than Claude 4 Sonnet.

replies(1): >>44729170 #
25. w10-1 ◴[] No.44726845{3}[source]
MLX github: https://github.com/ml-explore/mlx

get started: https://developer.apple.com/videos/play/wwdc2025/315/

details: https://developer.apple.com/videos/play/wwdc2025/298/

26. vaenaes ◴[] No.44726851[source]
Why not?
27. electroglyph ◴[] No.44727383[source]
unsloth is the easiest way to finetune due to the lower memory requirements
28. pdntspa ◴[] No.44727527[source]
Have you tried asking an LLM?
29. delijati ◴[] No.44729170{5}[source]
Do you maybe know if there is a company in the EU that hosts models (DeepSeek, Qwen3, Kimi)?
replies(1): >>44730893 #
30. jasonjmcghee ◴[] No.44729225[source]
brev.dev made an easy to follow guide a while ago but apparently Nvidia took it down or something when they bought them?

So here's the original

https://web.archive.org/web/20231127123701/https://brev.dev/...

31. israrkhan ◴[] No.44729336{3}[source]
do you have a suggestion or a way to measure if model capabilities are getting destroyed? how do one measure it objectively?
replies(2): >>44729673 #>>44733183 #
32. RALaBarge ◴[] No.44729673{4}[source]
Ask it a series of the same questions after you train that you posed before training started. Is the quality lower?
replies(1): >>44731429 #
33. reissbaker ◴[] No.44730893{6}[source]
Most inference companies (Synthetic included) host in a mix of the U.S. and EU — I don't know of any that promise EU-only hosting, though. Even Mistral doesn't promise EU-only AFAIK, despite being a French company. I think at that point you're probably looking at on-prem hosting, or buying a maxed-out Mac Studio and running the big models quantized to Q4 (although even that couldn't run Kimi: you might be able to get it working over ethernet with two Mac Studios, but the tokens/sec will be pretty rough).
34. israrkhan ◴[] No.44731429{5}[source]
That series of questions will measure only a particular area. I am concerned about destorying model capabilities in some other area that that I do not pay attention to, and have no way of knowing.
replies(1): >>44731676 #
35. otabdeveloper4 ◴[] No.44731516[source]
> So what's the big secret about LLM LoRA?

No clear use case for LLMs yet. ("Spicy" aka pornography finetunes are the only ones with broad adoption, but we don't talk about that in polite society here.)

replies(1): >>44732896 #
36. reverius42 ◴[] No.44731614{5}[source]
> coding isn't one homogeneous activity that one model beats all the other models at

If you can't even replace one coding model with another, it's hard to imagine you can replace human coders with coding models.

replies(2): >>44732619 #>>44733032 #
37. simonh ◴[] No.44731676{6}[source]
Isn’t that a general problem with LLMs? The only way to know how good it is at something is to test it.
38. diggan ◴[] No.44732619{6}[source]
What you mean "can't even replace"? You can, nothing in my comment says you cannot?
39. nxobject ◴[] No.44732817{3}[source]
My very cursory understanding -- at least from Unsloth's recommendations -- is that you have to work very hard to preserve reasoning/instruct capabilities [1]: for example to "preserve" Qwen3's reasoning capabilities (however that's operationalized), they suggest a fine-tuning corpus that's 75% chain of thought to 25% non-reasoning. Is that a significant issue for orgs/projects that currently rely on fine-tuning?

[1] https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tun...

40. AlecSchueler ◴[] No.44732896{3}[source]
Where do we speak about it? It feels like the biggest use for these models right now is for deep fakes and other harassment but few people in the industry want to talk about it while continuing to enable it.
41. Philpax ◴[] No.44733032{6}[source]
You probably can't replace a seasoned COBOL programmer with a seasoned Haskell programmer. Does that mean that either person is bad at programming as a whole?
replies(1): >>44733713 #
42. mensetmanusman ◴[] No.44733183{4}[source]
These are now questions at the cutting edge of academic research. It might be computationally unknowable until checked.
43. reverius42 ◴[] No.44733713{7}[source]
This was my point -- if programmers are not fungible, how can companies claim to be replacing them by the thousands with AI?
replies(1): >>44737233 #
44. Philpax ◴[] No.44737233{8}[source]
You don't need to use the same model/system for every task. "AI" isn't a monolith; there's a spectrum of solutions for a spectrum of problems, and figuring out what's applicable to your problem today is one of the larger problems of deployment.
45. eckelhesten ◴[] No.44743160[source]
It got leaked as a PR with an url to a magnet (torrent) afaik.
46. seunosewa ◴[] No.44862452{4}[source]
When prompt engineering isn't giving you reliable results.