Most active commenters

simonw(10)
devmor(7)
ben_w(7)
CamperBob2(6)
nottorp(5)
FeepingCreature(3)
0x457(3)
wizzwizz4(3)
dfedbeef(3)

Popular/hot comments

>>44723707 #
>>44724137 #
>>44723664 #
>>44723867 #
>>44724497 #
>>44724660 #
>>44726179 #
>>44726857 #

←back to thread

My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)

(simonwillison.net)

1. AlexeyBrin ◴[29 Jul 25 14:02 UTC] No.44723521[source]▶

>>44723316 (OP) #

Most likely its training data included countless Space Invaders in various programming languages.

replies(6): >>44723664 #>>44723707 #>>44723945 #>>44724116 #>>44724439 #>>44724690 #

2. quantumHazer ◴[29 Jul 25 14:15 UTC] No.44723664[source]▶

>>44723521 (TP) #

and probably some synthetic data are generated copy of the games already on the dataset?

i have this feeling with LLM's generated react frontend, they all look the same

replies(4): >>44723867 #>>44724566 #>>44724902 #>>44731430 #

3. NitpickLawyer ◴[29 Jul 25 14:19 UTC] No.44723707[source]▶

>>44723521 (TP) #

This comment is ~3 years late. Every model since gpt3 has had the entirety of available code in their training data. That's not a gotcha anymore.

We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.

These kinds of comments are really missing the point.

replies(7): >>44723808 #>>44723897 #>>44724175 #>>44724204 #>>44724397 #>>44724433 #>>44729201 #

4. haar ◴[29 Jul 25 14:26 UTC] No.44723808[source]▶

>>44723707 #

I've had little success with Agentic coding, and what success I have had has been paired with hours of frustration, where I'd have been better off doing it myself for anything but the most basic tasks.

Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.

I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.

replies(1): >>44724064 #

5. bayindirh ◴[29 Jul 25 14:29 UTC] No.44723867[source]▶

>>44723664 #

Last time somebody asked for a "premium camera app for iOS", and the model (re)generated Halide.

Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...

replies(4): >>44724102 #>>44724181 #>>44724845 #>>44726775 #

6. MyOutfitIsVague ◴[29 Jul 25 14:31 UTC] No.44723897[source]▶

>>44723707 #

I don't think they are missing the point, because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated. I use Gemini 2.5 Pro every day for coding, and even that one still falls over on tasks that aren't well known to it (which is why I break the problem down into small parts that I know it'll be able to handle properly).

It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.

Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.

For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.

replies(1): >>44724130 #

7. elif ◴[29 Jul 25 14:34 UTC] No.44723945[source]▶

>>44723521 (TP) #

Most likely this comment included countless similar comments in its training data, likely all synthetic without any actual tether to real analysis.

8. aschobel ◴[29 Jul 25 14:43 UTC] No.44724064{3}[source]▶

>>44723808 #

Bingo, it's magical but the learning curve is very very steep. The METR study on open-source productivity alluded to this a bit.

I am definitely at a point where I am more productive with it, but it took a bunch of effort.

replies(2): >>44724470 #>>44724770 #

9. FeepingCreature ◴[29 Jul 25 14:47 UTC] No.44724102{3}[source]▶

>>44723867 #

True where trivial; where nontrivial, false.

Trivially, humans don't emit something they don't know either. You don't spontaneously figure out Javascript from first principles, you put together your existing knowledge into new shapes.

Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times. Will it be put together from smaller fragments? Yes, this is called "experience" or if the fragments are small enough, "understanding".

replies(2): >>44724137 #>>44724530 #

10. Conflonto ◴[29 Jul 25 14:48 UTC] No.44724116[source]▶

>>44723521 (TP) #

That sounds so dismissive.

I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.

But hey.

11. NitpickLawyer ◴[29 Jul 25 14:49 UTC] No.44724130{3}[source]▶

>>44723897 #

> because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated

I agree with you, but your take is much more nuanced than what the GP comment said! These models don't simply regurgitate the training set. That was my point with gpt3. The models have advanced from that, and can now "generalise" over the context in ways they could not do ~3 years ago. We are now at a point where you can write a detailed spec (10-20k tokens) for an unseen scripting language, and have SotA models a) write a parser and b) start writing scripts for you in that language, even though it never saw that particular scripting language anywhere in its training set. Try it. You'll be surprised.

12. bayindirh ◴[29 Jul 25 14:50 UTC] No.44724137{4}[source]▶

>>44724102 #

Humans can observe ants and invent any colony optimization. AIs can’t.

Humans can explore what they don’t know. AIs can’t.

replies(5): >>44724200 #>>44724373 #>>44724567 #>>44724658 #>>44731957 #

13. jayd16 ◴[29 Jul 25 14:54 UTC] No.44724175[source]▶

>>44723707 #

I think you're missing the point.

Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.

Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.

14. satvikpendem ◴[29 Jul 25 14:54 UTC] No.44724181{3}[source]▶

>>44723867 #

This doesn't make sense thermodynamically because models are far smaller than the training data they purport to hold and recall, so there must be some level of "understanding" going on. Whether that's the same as human understanding is a different matter.

replies(1): >>44726179 #

15. falcor84 ◴[29 Jul 25 14:56 UTC] No.44724200{5}[source]▶

>>44724137 #

What makes you categorically say that "AIs can't"?

Based on my experience with present day AIs, I personally wouldn't be surprised at all that if you showed Gemini 2.5 Pro a video of an insect colony and asked it "Take a look at the way they organize and see if that gives you inspiration for an optimization algorithm", it will spit something interesting out.

replies(1): >>44725223 #

16. jan_Sate ◴[29 Jul 25 14:56 UTC] No.44724204[source]▶

>>44723707 #

Not exactly. The real utility value of LLM for programming is to come up with something new. For Space Invaders, instead of using LLM for that, I might as well just manually search for the code online and use that.

To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.

replies(2): >>44724519 #>>44724841 #

17. FeepingCreature ◴[29 Jul 25 15:10 UTC] No.44724373{5}[source]▶

>>44724137 #

What makes you categorically say that "humans can"?

I couldn't do that with an ant colony. I would have to train on ant research first.

(Oh, and AIs can absolutely explore what they don't know. Watch a Claude Code instance look at a new repository. Exploration is a convergent skill in long-horizon RL.)

18. Aurornis ◴[29 Jul 25 15:12 UTC] No.44724397[source]▶

>>44723707 #

> These kinds of comments are really missing the point.

I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.

Asking them to produce novel output that doesn’t match the training set produces very different results.

When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.

It shows the reality of working with LLMs and it’s an important consideration.

19. AlexeyBrin ◴[29 Jul 25 15:15 UTC] No.44724433[source]▶

>>44723707 #

You are reading too much into my comment. My point was that the test (a Space Invaders clone) used to asses the model is irrelevant for some time now. I could have gotten a similar result with Mistral Small a few months ago.

20. phkahler ◴[29 Jul 25 15:16 UTC] No.44724439[source]▶

>>44723521 (TP) #

I find the visual similarity to breakout kind of interesting.

21. devmor ◴[29 Jul 25 15:18 UTC] No.44724470{4}[source]▶

>>44724064 #

The subjects in the study you are referencing also believed that they were more productive with it. What metrics do you have to convince yourself you aren't under the same illusionary bias they were?

replies(1): >>44724497 #

22. simonw ◴[29 Jul 25 15:20 UTC] No.44724497{5}[source]▶

>>44724470 #

Yesterday I used ffmpeg to extract the frame at the 13 second mark of a video out as a JPEG.

If I didn't have an LLM to figure that out for me I wouldn't have done it at all.

replies(4): >>44724574 #>>44724628 #>>44724962 #>>44733418 #

23. devmor ◴[29 Jul 25 15:22 UTC] No.44724519{3}[source]▶

>>44724204 #

> The real utility value of LLM for programming is to come up with something new.

That's the goal for these projects anyways. I don't know that its true or feasible. I find the RAG models much more interesting myself, I see the technology as having far more value in search than generation.

Rather than write some markov-chain reminiscent frankenstein function when I ask it how to solve a problem, I would like to see it direct me to the original sources it would use to build those tokens, so that I can see their implementations in context and use my judgement.

replies(1): >>44724556 #

24. phkahler ◴[29 Jul 25 15:23 UTC] No.44724530{4}[source]▶

>>44724102 #

>> Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times.

I think most people writing software today are reinventing a wheel, even in corporate environments for internal tools. Everyone wants their own tweak or thinks their idea is unique and nobody wants to share code publicly, so everyone pays programmers to develop buggy bespoke custom versions of the same stuff that's been done 100 times before.

I guess what I'm saying is that your requirements are probably not new, and to the extent they are yes an LLM can fill in the blanks due to its fluency in languages.

replies(1): >>44743498 #

25. simonw ◴[29 Jul 25 15:24 UTC] No.44724556{4}[source]▶

>>44724519 #

"I would like to see it direct me to the original sources it would use to build those tokens"

Sadly that's not feasible with transformer-based LLMs: those original sources are long gone by the time you actually get to use the model, scrambled a billion times into a trained set of weights.

One thing that helped me understand this is understanding that every single token output by an LLM is the result of a calculation that considers all X billion parameters that are baked into that model (or a subset of that in the case of MoE models, but it's still billions of floating point calculations for every token.)

You can get an imitation of that if you tell the model "use your search tool and find example code for this problem and build new code based on that", but that's a pretty unconventional way to use a model. A key component of the value of these things is that they can spit out completely new code based on the statistical patterns they learned through training.

replies(1): >>44724604 #

26. tshaddox ◴[29 Jul 25 15:25 UTC] No.44724566[source]▶

>>44723664 #

To be fair, the human-generated user interfaces all look the same too.

replies(1): >>44724679 #

27. CamperBob2 ◴[29 Jul 25 15:25 UTC] No.44724567{5}[source]▶

>>44724137 #

That's what benchmarks like ARC-AGI are designed to test. The models are getting better at it, and you aren't.

Nothing ultimately matters in this business except the first couple of time derivatives.

28. devmor ◴[29 Jul 25 15:26 UTC] No.44724574{6}[source]▶

>>44724497 #

You wouldn't have just typed "extract frame at timestamp as jpeg ffmpeg" into Google and used the StackExchange result that comes up first that gives you a command to do exactly that?

replies(1): >>44724615 #

29. devmor ◴[29 Jul 25 15:28 UTC] No.44724604{5}[source]▶

>>44724556 #

I am aware, and that's exactly why I don't think they're anywhere near as useful for this type of work as the people pushing them want them to be.

I tried to push for this type of model when an org I worked with over a decade ago was first exploring using the first generation of Tensorflow to drive customer service chatbots and was sadly ignored.

replies(1): >>44724629 #

30. simonw ◴[29 Jul 25 15:29 UTC] No.44724615{7}[source]▶

>>44724574 #

Before LLMs made ffmpeg no-longer-frustrating-to-use I genuinely didn't know that ffmpeg COULD do things like that.

replies(1): >>44726857 #

31. dingnuts ◴[29 Jul 25 15:31 UTC] No.44724628{6}[source]▶

>>44724497 #

It is nice to use LLMs to generate ffmpeg commands, because those can be pretty tricky, but really, you wouldn't have just used the man page before?

That explains a lot about Django that the author is allergic to man pages lol

replies(2): >>44724660 #>>44726328 #

32. simonw ◴[29 Jul 25 15:31 UTC] No.44724629{6}[source]▶

>>44724604 #

I don't understand. For code, why would I want to remix existing code snippets?

I totally get the value of RAG style patterns for information retrieval against factual information - for those I don't want the LLM to answer my question directly, I want it to run a search and show me a citation and directly quote a credible source as part of answering.

For code I just want code that works - I can test it myself to make sure it does what it's supposed to.

replies(1): >>44724850 #

33. ben_w ◴[29 Jul 25 15:33 UTC] No.44724658{5}[source]▶

>>44724137 #

> Humans can observe ants and invent any colony optimization. AIs can’t.

Surely this is exactly what current AI do? Observe stuff and apply that observation? Isn't this the exact criticism, that they aren't inventing ant colonies from first principles without ever seeing one?

> Humans can explore what they don’t know. AIs can’t.

We only learned to decode Egyptian hieroglyphs because of the Rosetta Stone. There's no translation for North Sentinelese, the Voynich manuscript, or Linear A.

We're not magic.

34. simonw ◴[29 Jul 25 15:33 UTC] No.44724660{7}[source]▶

>>44724628 #

I just took a look, and the man page DOES explain how to do that!

... on line 3,218: https://gist.github.com/simonw/6fc05ea7392c5fb8a5621d65e0ed0...

(I am very confident I am not the only person who has been deterred by ffmpeg's legendarily complex command-line interface. I feel no shame about this at all.)

replies(3): >>44725920 #>>44730126 #>>44731612 #

35. ◴[29 Jul 25 15:34 UTC] No.44724679{3}[source]▶

>>44724566 #

36. gblargg ◴[29 Jul 25 15:34 UTC] No.44724690[source]▶

>>44723521 (TP) #

The real test is if you can have it tweak things. Have the ship shoot down. Have the space invaders come from the left and right. Add two player simultaneous mode with two ships.

replies(1): >>44727900 #

37. haar ◴[29 Jul 25 15:41 UTC] No.44724770{4}[source]▶

>>44724064 #

Apologies if I was unclear.

The more I've used it, the more I've disliked how poor the results it's produced, and the more I've realised I would have been better served by doing it myself and following a methodical path for things that I didn't have experience with.

It's easier to step through a problem as I'm learning and making small changes than an LLM going "It's done, and production ready!" where it just straight up doesn't work for 101 different tiny reasons.

replies(1): >>44732184 #

38. tracker1 ◴[29 Jul 25 15:48 UTC] No.44724841{3}[source]▶

>>44724204 #

I have a friend who has been doing just that... usually with his company he manages a handful of projects where a bulk of the development is outsourced overseas. This past year, he's outpaced the 6 devs he's had working on misc projects just with his own efforts and AI. Most of this being a relatively unique combination of UX with features that are less common.

He's using AI with note taking apps for meetings to enhance notes and flush out technology ideas at a higher level, then refining those ideas into working experiments.

It's actually impressive to see. My personal experience has been far more disappointing to say the least. I can't speak to the code quality, consistency or even structure in terms of most people being able to maintain such applications though. I've asked to shadow him through a few of his vibe coding sessions to see his workflow. It feels rather alien to me, again my experience is much more disappointing in having to correct AI errors.

replies(1): >>44725937 #

39. Uehreka ◴[29 Jul 25 15:48 UTC] No.44724845{3}[source]▶

>>44723867 #

> Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...

People really need to stop saying this. I get that it was the Smart Guy Thing To Say in 2023, but by this point it’s pretty clear that that it’s not true in any way that matters for most practical purposes.

Coding LLMs have clearly been trained on conversations where a piece of code is shown, a transformation is requested (rewrite this from Python to Go), and then the transformed code is shown. It’s not that they’re just learning codebases, they’re learning what working with code looks like.

Thus you can ask an LLM to refactor a program in a language it has never seen, and it will “know” what refactoring means, because it has seen it done many times, and it will stand a good chance of doing the right thing.

That’s why they’re useful. They’re doing something way more sophisticated than just “recombining codebases from their training data”, and anyone chirping 2023 sound bites is going to miss that.

replies(2): >>44731840 #>>44739406 #

40. devmor ◴[29 Jul 25 15:49 UTC] No.44724850{7}[source]▶

>>44724629 #

> I don't understand. For code, why would I want to remix existing code snippets?

That is what you're doing already. You're just relying on a vector compression and search engine to hide it from you and hoping the output is what you expect, instead of having it direct you to where it remixed those snippets from so you can see how they work to start with and make sure its properly implemented from the get-go.

We all want code that works, but understanding that code is a critical part of that for anything but a throw-away one time use script.

I don't really get this desire to replace critical thought with hoping and testing. It sounds like the pipe dream of a middle manager, not a tool for a programmer.

replies(1): >>44725565 #

41. cchance ◴[29 Jul 25 15:53 UTC] No.44724902[source]▶

>>44723664 #

Have you used the internet? thats how the internet looks, their all fuckin react and the same layouts and styles 90% shadcn lol

42. throwworhtthrow ◴[29 Jul 25 15:58 UTC] No.44724962{6}[source]▶

>>44724497 #

LLM's still give subpar results with ffmpeg. For example when I asked Sonnet to trim a long video with ffmpeg, it put the input file parameter before the start time parameter, which triggers an unnecessary decode of the video file. [1]

Sure, use the LLM to get over the initial hump. But ffmpeg's no exception to the rule that LLM's produce subpar code. It's worth spending a couple minutes reading the docs to understand what it did so you can do it better, and unassisted, next time.

[1] https://ffmpeg.org/ffmpeg.html#:~:text=ss%20position

replies(1): >>44725343 #

43. sarchertech ◴[29 Jul 25 16:18 UTC] No.44725223{6}[source]▶

>>44724200 #

It will 100% have something in its training set discussing a human doing this and will almost definitely spit out something similar.

replies(1): >>44732015 #

44. CamperBob2 ◴[29 Jul 25 16:27 UTC] No.44725343{7}[source]▶

>>44724962 #

That says more about suboptimal design on ffmpeg's part than it does about the LLM. Most humans can't deal with ffmpeg command lines, so it's not surprising that the LLM misses a few tricks.

replies(1): >>44725912 #

45. stavros ◴[29 Jul 25 16:43 UTC] No.44725565{8}[source]▶

>>44724850 #

I don't understand your point. You seem to be saying that we should be getting code from the source, then adapting it to our project ourselves, instead of getting adapted code to begin with.

I'm going to review the code anyway, why would I not want to save myself some of the work? I can "see how they work" after the LLM gives them to me just fine.

replies(1): >>44726844 #

46. nottorp ◴[29 Jul 25 17:13 UTC] No.44725912{8}[source]▶

>>44725343 #

Had a LLM generate 3 lines of working C++ code that was "only" one order of magnitude slower than what i edited the code to in 10 minutes.

If you're happy with results like that, sure, LLMs miss "a few tricks"...

replies(1): >>44726406 #

47. quesera ◴[29 Jul 25 17:14 UTC] No.44725920{8}[source]▶

>>44724660 #

Ffmpeg is genuinely complicated! And the CLI is convoluted (in justifiable, and unfortunate ways).

But if you approach ffmpeg from the perspective of "I know this is possible", you are always correct, and can almost always reach the "how" in a handful of minutes.

Whether that's worth it or not, will vary. :)

48. nottorp ◴[29 Jul 25 17:15 UTC] No.44725937{4}[source]▶

>>44724841 #

Is this the same person who posted about launching 17 "products" in one year a few days ago on HN? :)

replies(1): >>44728523 #

49. Eggpants ◴[29 Jul 25 17:37 UTC] No.44726179{4}[source]▶

>>44724181 #

It’s a lossy text compression technique. It’s clever applied statistics. Basically an advanced association rules algorithm which has been around for decades but modified to consider order and relative positions.

There is no understanding, regardless of the wants of all the capital investors in this domain.

replies(3): >>44726653 #>>44726720 #>>44728418 #

50. ben_w ◴[29 Jul 25 17:49 UTC] No.44726328{7}[source]▶

>>44724628 #

I remember when I was a kid, people asking a teacher how to spell a word, and the answer was generally "look it up in a dictionary"… which you can only do if you already have shortlist of possible spellings.

*nix man pages are the same: if you already know which tool can solve your problem, they're easy to use. But you have to already have a shortlist of tools that can solve your problem, before you even know which man pages to read.

replies(2): >>44729432 #>>44734259 #

51. ben_w ◴[29 Jul 25 17:56 UTC] No.44726406{9}[source]▶

>>44725912 #

You don't have to leave LLM code alone, it's fine to change it — unless, I guess, you're doing some kind of LLM vibe-code-golfing?

But this does remind me of a previous co-worker. Wrote something to convert from a custom data store to a database, his version took 20 minutes on some inputs. Swore it couldn't possibly be improved. Obviously ridiculous because it didn't take 20 minutes to load from the old data store, nor to load from the new database. Over the next few hours of looking at very mediocre code, I realised it was doing an unnecessary O(n^2) check, confirmed with the CTO it wasn't business-critical, got rid of it, and the same conversion on the same data ran in something like 200ms.

Over a decade before LLMs.

replies(1): >>44726438 #

52. nottorp ◴[29 Jul 25 17:59 UTC] No.44726438{10}[source]▶

>>44726406 #

We all do that, sometimes where it’s time critical sometimes where it isn’t.

But I keep being told “AI” is the second coming of Ahura Mazda so it shouldn’t do stuff like that right?

replies(2): >>44726777 #>>44727506 #

53. simonw ◴[29 Jul 25 18:18 UTC] No.44726653{5}[source]▶

>>44726179 #

I don't care if it can "understand" anything, as long as I can use it to achieve useful things.

replies(1): >>44726747 #

54. ◴[29 Jul 25 18:23 UTC] No.44726720{5}[source]▶

>>44726179 #

55. Eggpants ◴[29 Jul 25 18:26 UTC] No.44726747{6}[source]▶

>>44726653 #

“useful things“ like poorly drawing birds on bikes? ;)

(I have much respect for what you have done and are currently doing, but you did walk right into that one)

replies(1): >>44729114 #

56. mr_toad ◴[29 Jul 25 18:29 UTC] No.44726775{3}[source]▶

>>44723867 #

> They remix and rewrite what they know. There's no invention, just recall...

If they only recalled they wouldn’t “hallucinate”. What’s a lie if not an invention? So clearly they can come up with data that they weren’t trained on, for better or worse.

replies(1): >>44727316 #

57. CamperBob2 ◴[29 Jul 25 18:29 UTC] No.44726777{11}[source]▶

>>44726438 #

"I'm taking this talking dog right back to the pound. It told me to short NVDA, and you should see the buffer overflow bugs in the C++ code it wrote. Totally overhyped. I don't get it."

replies(1): >>44726861 #

58. devmor ◴[29 Jul 25 18:37 UTC] No.44726844{9}[source]▶

>>44725565 #

The work that you are "saving" is the work of using your brain to determine the solution to the problem. Whatever the LLM gives you doesn't have a context it is used in other than your prompt - you don't even know what it does until after you evaluate it.

If you instead have a set of sources related to your problem, they immediately come with context, usage and in many cases, developer notes and even change history to show you mistakes and adaptations.

You're ultimately creating more work for yourself* by trying to avoid work, and possibly ending up with an inferior solution in the process. Where is your sense of efficiency? Where is your pride as a intellectual?

* Yes, you are most likely creating more work for yourself even if you think you are capable of telling otherwise. [1]

1. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

replies(2): >>44726914 #>>44727149 #

59. devmor ◴[29 Jul 25 18:38 UTC] No.44726857{8}[source]▶

>>44724615 #

I'm not really sure what you're saying an LLM did in this case. Inspired a lost sense of curiosity?

replies(3): >>44727355 #>>44727394 #>>44727490 #

60. nottorp ◴[29 Jul 25 18:39 UTC] No.44726861{12}[source]▶

>>44726777 #

"We hear you have been calling our deity a talking dog. Please enter the red door for reeducation."

61. stavros ◴[29 Jul 25 18:44 UTC] No.44726914{10}[source]▶

>>44726844 #

Thanks for the concern, but I'm perfectly able to judge for myself whether I'm creating more work or delivering an inferior product.

62. simonw ◴[29 Jul 25 19:09 UTC] No.44727149{10}[source]▶

>>44726844 #

It sounds like you care deeply about learning as much as you can. I care about that too.

I would encourage you to consider that even LLM-generated code can teach you a ton of useful new things.

Go read the source code for my dumb, zero-effort space invaders clone: https://github.com/simonw/tools/blob/main/space-invaders-GLM...

There's a bunch of useful lessons to be picked up even from that!

- Examples of CSS gradients, box shadows and flexbox layout

- CSS keyframe animation

- How to implement keyboard events in JavaScript

- A simple but effective pattern for game loops against a Canvas element, using requestAnimationFrame

- How to implement basic collision detection

If you've written games like this before these may not be new to you, but I found them pretty interesting.

63. 0x457 ◴[29 Jul 25 19:26 UTC] No.44727316{4}[source]▶

>>44726775 #

Because internally, there isn't a difference between correctly "recalled" token and incorrectly (hallucinated).

replies(1): >>44734656 #

64. Philpax ◴[29 Jul 25 19:30 UTC] No.44727355{9}[source]▶

>>44726857 #

Translated a vague natural language query ("cli, extract frame 13s into video") into something immediately actionable with specific examples and explanations, surfacing information that I would otherwise not know how to search for.

That's what I've done with my ffmpeg LLM queries, anyway - can't speak for simonw!

replies(1): >>44728128 #

65. 0x457 ◴[29 Jul 25 19:34 UTC] No.44727394{9}[source]▶

>>44726857 #

LLM somewhat understood ffmpeg documentation? Not sure what is not clear here.

66. simonw ◴[29 Jul 25 19:45 UTC] No.44727490{9}[source]▶

>>44726857 #

My general point is that people say things like "yeah, but this one study showed that programmers over-estimate the productivity gain they get from LLMs so how can you really be sure?"

Meanwhile I've spent the past two years constantly building and implementing things I never would have done because of the reduction in friction LLM assistance gives me.

I wrote about this first two years ago - AI-enhanced development makes me more ambitious with my projects - https://simonwillison.net/2023/Mar/27/ai-enhanced-developmen... - when I realized I was hacking on things with tech like AppleScript and jq that I'd previously avoided.

It's hard to measure the productivity boost you get from "wouldn't have built that thing" to "actually built that thing".

replies(1): >>44739153 #

67. ben_w ◴[29 Jul 25 19:47 UTC] No.44727506{11}[source]▶

>>44726438 #

> Ahura Mazda

Niche reference, I like it.

But… I only hear of scammers who say, and psychosis sufferers who think, LLMs are *already* that competent.

Future AI? Sure, lots of sane-seeming people also think it could go far beyond us. Special purpose ones have in very narrow domains. But current LLMs are only good enough to be useful and potentially economically disruptive, they're not even close to wildly superhuman like Stockfish is.

replies(1): >>44728116 #

68. wizzwizz4 ◴[29 Jul 25 20:28 UTC] No.44727900[source]▶

>>44724690 #

It can usually tweak things, if given specific instruction, but it doesn't know when to refactor (and can't reliably preserve functionality when it does), so the program gets further and further away from something sensible until it can't make edits any more.

replies(1): >>44727913 #

69. simonw ◴[29 Jul 25 20:30 UTC] No.44727913{3}[source]▶

>>44727900 #

For serious projects you can address that by writing (or having it write) unit tests along the way, that way it can run in a loop and avoid breaking existing functionality when it adds new changes.

replies(1): >>44728283 #

70. CamperBob2 ◴[29 Jul 25 20:49 UTC] No.44728116{12}[source]▶

>>44727506 #

Sure. If you ask ChatGPT to play chess, it will put up an amateur-level effort at best. Stockfish will indeed wipe the floor with it. But what happens when you ask Stockfish to write a Space Invaders game?

ChatGPT will get better at chess over time. Stockfish will not get better at anything except chess. That's kind of a big difference.

replies(1): >>44728303 #

71. wizzwizz4 ◴[29 Jul 25 20:50 UTC] No.44728128{10}[source]▶

>>44727355 #

DuckDuckGo search results for "cli, extract frame 13s into video" (no quotes):

• https://stackoverflow.com/questions/10957412/fastest-way-to-...

• https://superuser.com/questions/984850/linux-how-to-extract-...

• https://www.aleksandrhovhannisyan.com/notes/video-cli-cheat-...

• https://www.baeldung.com/linux/ffmpeg-extract-video-frames

• https://ottverse.com/extract-frames-using-ffmpeg-a-comprehen...

Search engines have been able to translate "vague natural language queries" into search results for a decade, now. This pre-existing infrastructure accounts for the vast majority of ChatGPT's apparent ability to find answers.

replies(1): >>44729497 #

72. greesil ◴[29 Jul 25 21:08 UTC] No.44728283{4}[source]▶

>>44727913 #

Okay ask it to write unit tests for space invaders next time :)

73. ben_w ◴[29 Jul 25 21:11 UTC] No.44728303{13}[source]▶

>>44728116 #

> ChatGPT will get better at chess over time

Oddly, LLMs got worse at specifically chess: https://dynomight.net/chess/

But even to the general point, there's absolutely no agreement how much better the current architectures can ultimately get, nor how quickly they can get there.

Do they have potential for unbounded improvements, albeit at exponential cost for each linear incremental improvement? Or will they asymptomatically approach someone with 5 years experience, 10 years experience, a lifetime of experience, or a higher level than any human?

If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance; and separately claim that even if they're actually unbounded with exponential cost for linear returns, we can't afford the training cost needed to make them act like someone with even just 6 years professional experience in any given subject.

Which is still a lot. Especially as it would be acting like it had about as much experience in every other subject at the same time. Just… not a literal Ahura Mazda.

replies(1): >>44728752 #

74. CamperBob2 ◴[29 Jul 25 21:24 UTC] No.44728418{5}[source]▶

>>44726179 #

It’s a lossy text compression technique.

That is a much, much bigger deal than you make it sound like.

Compression may, in fact, be all we need. For that matter, it may be all there is.

75. tracker1 ◴[29 Jul 25 21:35 UTC] No.44728523{5}[source]▶

>>44725937 #

No, he's been working on building a larger eLearning solution with some interesting workflow analytics around courseware evaluation and grading. He's been involved in some of the newer LRS specifications and some implementation details to bridge training as well as real world exposure scenarios. Working a lot with first responders, incident response training etc.

I've worked with him off and on for years from simulating aircraft diagnostics hardware to incident command simulation and setting up core infrastructure for F100 learning management backends.

76. CamperBob2 ◴[29 Jul 25 21:59 UTC] No.44728752{14}[source]▶

>>44728303 #

If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance

(Shrug) People with actual money to spend are betting twelve figures that you're wrong.

Should be fun to watch it shake out from up here in the cheap seats.

replies(2): >>44728976 #>>44732191 #

77. ben_w ◴[29 Jul 25 22:26 UTC] No.44728976{15}[source]▶

>>44728752 #

Nah, trillion dollars is about right for "ok". Percentage point of the global economy in cost, automate 2 percent and get a huge margin. We literally set more than that on actual fire each year.

For "pretty good", it would be worth 14 figures, over two years. The global GDP is 14 figures. Even if this only automated 10% of the economy, it pays for itself after a decade.

For "Ahura Mazda", it would easily be worth 16 figures, what with that being the principal God and god of the sky in Zoroastrianism, and the only reason it stops at 16 is the implausibility of people staying organised for longer to get it done.

78. msephton ◴[29 Jul 25 22:43 UTC] No.44729114{7}[source]▶

>>44726747 #

The pelican on a bicycle is a very useful test.

replies(1): >>44733323 #

79. stolencode ◴[29 Jul 25 22:55 UTC] No.44729201[source]▶

>>44723707 #

It's amazing that none of you even try to falsify you claims anymore. You can literally just put some of the code in a search engine and find the prior art example:

https://www.web-leb.com/en/code/2108

Your "AI tools" are just "copyright whitewashing machines."

These kinds of comments are really ignoring reality.

80. adastra22 ◴[29 Jul 25 23:29 UTC] No.44729432{8}[source]▶

>>44726328 #

That’s what GNU info is for, of course.

81. stelonix ◴[29 Jul 25 23:40 UTC] No.44729497{11}[source]▶

>>44728128 #

Yet the interface is fundamentally different, the output feels much more like bro pages[0] and it's within a click of clipboarding, one CTRL V away from extracting the 13th second screenshot. I've been using Google the past 24 years and my google-fu has always left people amazed; yet I can no longer bother to go through Stack Exchange's results when an LLM not only spits it out so nicely, but also does the equivalent of a explainshell[1].

Not comparable and I fail to see why going through Google's ads/results would be better?

[0] https://github.com/pombadev/bropages

[1] https://github.com/idank/explainshell

replies(1): >>44731050 #

82. lexh ◴[30 Jul 25 01:22 UTC] No.44730126{8}[source]▶

>>44724660 #

To be a little more fair... that example is tidily slotted into the EXAMPLES section, under the heading "You can extract images from a video, or create a video from many images".

I don't think most people read the man pages top to bottom. And even if they did, then for as much grief as you're giving ffmpeg, llm has an even larger burden... no man page and the docs weigh in at over 8k lines.

I get the general point that ffmpeg is a powerful, complex tool... but this is a weird fight to pick.

replies(1): >>44730154 #

83. simonw ◴[30 Jul 25 01:29 UTC] No.44730154{9}[source]▶

>>44730126 #

I could not be more confident that "ffmpeg is difficult to figure out" is not a weird fight to pick. It's notorious!

84. wizzwizz4 ◴[30 Jul 25 05:11 UTC] No.44731050{12}[source]▶

>>44729497 #

DuckDuckGo insists on shoving "AI Assist" entries in its results, so I have a reasonable idea of how often LLMs are completely wrong even given search results. The answer's still "more than one time in five".

I did not suggest using Google Search (the company's on record as deliberately making Google Search worse), but there are other search engines. My preferred search engines don't do the fancy "interpret natural language queries" pre-processing, because I'm quite good at doing that in my head and often want to research niche stuff, but there are many still-decent search engines that do, and don't have ads in the results.

Heck, you can even pay for a good search engine! And you can have it redirect you to the relevant section of the top search result automatically: Google used to call this "I'm feeling lucky!" (although it was before URI text fragments, so it would just send you to the top of the page). All the properties you're after, much more cheaply, and you keep the information about provenance, and your answer is more-reliably accurate.

replies(1): >>44736573 #

85. tw1984 ◴[30 Jul 25 06:38 UTC] No.44731430[source]▶

>>44723664 #

most human generated methods look the same. in fact, in SWE, we reward people for generating code that look & feel the same, they call it "work as a team".

86. otabdeveloper4 ◴[30 Jul 25 07:14 UTC] No.44731612{8}[source]▶

>>44724660 #

The correct solution here would have been to feed the man page to an LLM summarizer.

Alas instead of correct and easy solutions to problems we are focused on sci-fi robot assitant bullshit.

87. cztomsik ◴[30 Jul 25 07:58 UTC] No.44731840{4}[source]▶

>>44724845 #

I don't know, I have mixed-bag experiences and it's not really improving. It greatly varies depending on the programming language and the kind of problem which I'm trying to solve.

The tasks where it works great are things I'd expect to be part of dataset (github, blog posts), or they are "classic" LM tasks (understand + copy-paste/patch). The actual intelligence, in my opinion, is still very limited. So while it's true it's not "just recall" it still might be "mostly recall".

BTW: Copy-paste is something which works great in any attention-based model. On the other hand, models like RWKV usually fail and are not suited for this IMHO (but I think they have much better potential for the AGI)

88. numpad0 ◴[30 Jul 25 08:19 UTC] No.44731957{5}[source]▶

>>44724137 #

humans also eat

89. fc417fc802 ◴[30 Jul 25 08:32 UTC] No.44732015{7}[source]▶

>>44725223 #

That's a good point but all it means is that we can't test the hypothesis one way or the other due to never being entirely certain that a given task isn't anywhere in the training data. Supposing that "AIs can't" is then just as invalid as supposing that "AIs can".

90. airspresso ◴[30 Jul 25 09:03 UTC] No.44732184{5}[source]▶

>>44724770 #

My preferred approach to avoid that outcome is to divide & conquer the problem. Ask the LLM to implement each small bit in the order you'd implement it yourself given what you know about the codebase.

91. nottorp ◴[30 Jul 25 09:04 UTC] No.44732191{15}[source]▶

>>44728752 #

> People with actual money to spend are betting

... but those "people with actual money to spend" have burned money on fads before. Including on "AI", several times before the current hysterics.

If you're a good actor/psychologist, it's probably a good business model to figure out how to get VC money and how to justify your startup failing so they give you money for the next startup.

92. dfedbeef ◴[30 Jul 25 12:28 UTC] No.44733323{8}[source]▶

>>44729114 #

Yeah what if you need a drawing of a pelican on a bicycle

93. dfedbeef ◴[30 Jul 25 12:36 UTC] No.44733418{6}[source]▶

>>44724497 #

Was the answer:

ffmpeg -ss 00:00:13:00 -i myvideo.avi -frames:v 1 myimage.jpeg

Because this is on stack overflow and it took maybe one second to find.

I've found reading the man page for a tool is usually a better way to learn what a tool can do for you now and also in the future.

replies(1): >>44733619 #

94. kamranjon ◴[30 Jul 25 12:55 UTC] No.44733619{7}[source]▶

>>44733418 #

This is the rub for me… people are so quick to forget the original source for a lot of the data these models were trained on, and how easy and useful these platforms were. Now Google will summarize this question for you in an AI overview before you even land on Stack Overflow. It’s killing the network effect of the open web and destroying our crowd sourced platforms in favor of a lossy compression algorithm that will eventually be regurgitating its own entrails.

replies(1): >>44733819 #

95. dfedbeef ◴[30 Jul 25 13:17 UTC] No.44733819{8}[source]▶

>>44733619 #

Well, maybe. People will just stop using them and will make fun of people who do. You can only bullshit people for so long.

96. 082349872349872 ◴[30 Jul 25 13:55 UTC] No.44734259{8}[source]▶

>>44726328 #

man -k (or apropos)

replies(1): >>44739099 #

97. pbhjpbhj ◴[30 Jul 25 14:24 UTC] No.44734656{5}[source]▶

>>44727316 #

Depends on the training? If there was eg RLHF then those connections are stronger and more likely; that's a difference (but not a category difference).

replies(1): >>44759348 #

98. delian66 ◴[30 Jul 25 16:53 UTC] No.44736573{13}[source]▶

>>44731050 #

> Heck, you can even pay for a good search engine!

Can you recommend one?

99. ben_w ◴[30 Jul 25 20:26 UTC] No.44739099{9}[source]▶

>>44734259 #

`apropos` would itself be an example of a *nix tool that I didn't know existed and therefore wouldn't have known to find out more about.

100. aschobel ◴[30 Jul 25 20:31 UTC] No.44739153{10}[source]▶

>>44727490 #

"You can just do things".

Agreed on all fronts. jq and AppleScript are a total syntax mystery to me, but now I use them all the times since claude code has figured them out.

It's so powerful knowing the shape of a solution on not having to care about the details.

101. yencabulator ◴[30 Jul 25 20:57 UTC] No.44739406{4}[source]▶

>>44724845 #

> It’s not that they’re just learning codebases, they’re learning what working with code looks like.

Working in any not-in-training-set environment very quickly shows the shortcomings of this belief.

For example, Cloudflare Workers is V8 but it sure ain't Node, and the local sqlite in a Durable Object has a sync API with very different guarantees than a typical client-server SQL setup.

Even in a more standard setting, it's really hard to even get an LLM to use the current-stable APIs when its training data contains now-deprecated examples. Your local rules, llms.txt mentions, corrections etc slip out of the context pretty fast and it goes back to trained data.

The LLM can perhaps "read any code" but it really really prefers writing only code that was in its training set.

102. FeepingCreature ◴[31 Jul 25 08:12 UTC] No.44743498{5}[source]▶

>>44724530 #

Nothing is truly and completely new. I'm not formulating my requirements in an extinct language. My point is "filling in the blanks" and "do new things" are a spectrum.

LLMs have their limits, but they really can understand and productively contribute to programs that achieve a purpose that no program on the internet has done yet. What they are doing is not interpolation at the highest level. It may be interpolation/extrapolation at a lower level, but this goes for any skill learnt by anyone ever.

103. 0x457 ◴[01 Aug 25 16:53 UTC] No.44759348{6}[source]▶

>>44734656 #

Yes, but I thought we're talking about category difference.

Proper RLHF surely boosts "predicted next token until it couldn't" to feel more like "actually recalled".

↑