The unreasonable effectiveness of an LLM agent loop with tool use

(sketch.dev)

Show context

kgeist ◴[15 May 25 20:28 UTC] No.43998994[source]▶

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

replies(31): >>43999028 #>>43999055 #>>43999097 #>>43999162 #>>43999169 #>>43999248 #>>43999263 #>>43999272 #>>43999296 #>>43999300 #>>43999358 #>>43999373 #>>43999390 #>>43999401 #>>43999402 #>>43999497 #>>43999556 #>>43999610 #>>43999916 #>>44000527 #>>44000695 #>>44001136 #>>44001181 #>>44001568 #>>44001697 #>>44002185 #>>44002837 #>>44003198 #>>44003824 #>>44008480 #>>44048487 #

1. simonw ◴[15 May 25 21:13 UTC] No.43999401[source]▶

>>43998994 #

"It used a deprecated package"

That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...

I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

You can tell it "look up the most recent version of library X and use that" and it will often work!

I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:

This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

replies(3): >>43999494 #>>44001520 #>>44005551 #

2. kgeist ◴[15 May 25 21:27 UTC] No.43999494[source]▶

>>43999401 (TP) #

>That's because models have training cut-off dates

When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

Thanks for the tip!

replies(2): >>43999578 #>>43999708 #

3. fragmede ◴[15 May 25 21:39 UTC] No.43999578[source]▶

>>43999494 #

There's still skill involved with using the LLM in coding. In this case, o4-mini-high might do the trick, but the easier answer that worry's with other models is to include the high level library documentation yourself as context and it'll use that API.

replies(1): >>44001060 #

4. jmcpheron ◴[15 May 25 21:53 UTC] No.43999708[source]▶

>>43999494 #

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.

replies(1): >>44001517 #

5. th0ma5 ◴[16 May 25 01:38 UTC] No.44001060{3}[source]▶

>>43999578 #

Whate besides anecdote makes you think a different model will be anything but marginally incrementally better?

6. tptacek ◴[16 May 25 03:17 UTC] No.44001517{3}[source]▶

>>43999708 #

Model naming is absolutely maddening.

7. sagarpatil ◴[16 May 25 03:17 UTC] No.44001520[source]▶

>>43999401 (TP) #

Context7 MCP solves this. Use it with Cursor/Windsurf.

8. mbesto ◴[16 May 25 13:53 UTC] No.44005551[source]▶

>>43999401 (TP) #

> That's because models have training cut-off dates.

Which is precisely the issue with the idea of LLMs completely replacing human engineers. It doesn't understand this context unless a human tells it to understand that context.

replies(1): >>44018209 #

9. simonw ◴[18 May 25 01:08 UTC] No.44018209[source]▶

>>44005551 #

Right: the idea that LLMs are a replacement for human engineers is deeply flawed in my opinion.

↑