My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)

(simonwillison.net)

577 points simonw | 1 comments | 29 Jul 25 13:45 UTC | HN request time: 0s | source

Show context

NitpickLawyer ◴[29 Jul 25 14:03 UTC] No.44723522[source]▶

> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.

Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.

And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)

By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.

replies(4): >>44723679 #>>44724534 #>>44726611 #>>44734796 #

tonyhart7 ◴[29 Jul 25 14:16 UTC] No.44723679[source]▶

>>44723522 #

is GLM 4.5 better than Qwen3 coder??

replies(2): >>44723712 #>>44723745 #

diggan ◴[29 Jul 25 14:19 UTC] No.44723712[source]▶

>>44723679 #

For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.

My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.

They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.

replies(1): >>44724438 #

kelvinjps10 ◴[29 Jul 25 15:16 UTC] No.44724438[source]▶

>>44723712 #

coding? they are coding models? what specific tasks is one performing better than the other?

replies(2): >>44724873 #>>44724912 #

diggan ◴[29 Jul 25 15:51 UTC] No.44724873[source]▶

>>44724438 #

They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.

> what specific tasks is one performing better than the other?

That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".

replies(1): >>44731614 #

reverius42 ◴[30 Jul 25 07:15 UTC] No.44731614{3}[source]▶

>>44724873 #

> coding isn't one homogeneous activity that one model beats all the other models at

If you can't even replace one coding model with another, it's hard to imagine you can replace human coders with coding models.

replies(2): >>44732619 #>>44733032 #

Philpax ◴[30 Jul 25 11:52 UTC] No.44733032{4}[source]▶

>>44731614 #

You probably can't replace a seasoned COBOL programmer with a seasoned Haskell programmer. Does that mean that either person is bad at programming as a whole?

replies(1): >>44733713 #

reverius42 ◴[30 Jul 25 13:06 UTC] No.44733713{5}[source]▶

>>44733032 #

This was my point -- if programmers are not fungible, how can companies claim to be replacing them by the thousands with AI?

replies(1): >>44737233 #

1. Philpax ◴[30 Jul 25 17:42 UTC] No.44737233{6}[source]▶

>>44733713 #

You don't need to use the same model/system for every task. "AI" isn't a monolith; there's a spectrum of solutions for a spectrum of problems, and figuring out what's applicable to your problem today is one of the larger problems of deployment.

↑