The unreasonable effectiveness of an LLM agent loop with tool use

(sketch.dev)

447 points crawshaw | 2 comments | 15 May 25 19:33 UTC | HN request time: 0.471s | source

Show context

kgeist ◴[15 May 25 20:28 UTC] No.43998994[source]▶

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

replies(31): >>43999028 #>>43999055 #>>43999097 #>>43999162 #>>43999169 #>>43999248 #>>43999263 #>>43999272 #>>43999296 #>>43999300 #>>43999358 #>>43999373 #>>43999390 #>>43999401 #>>43999402 #>>43999497 #>>43999556 #>>43999610 #>>43999916 #>>44000527 #>>44000695 #>>44001136 #>>44001181 #>>44001568 #>>44001697 #>>44002185 #>>44002837 #>>44003198 #>>44003824 #>>44008480 #>>44048487 #

nico ◴[15 May 25 20:54 UTC] No.43999248[source]▶

>>43998994 #

4o and 4.1 are not very good at coding

My best results are usually with 4o-mini-high, o3 is sometimes pretty good

I personally don’t like the canvas. I prefer the output on the chat

And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)

replies(2): >>43999569 #>>43999689 #

manmal ◴[15 May 25 21:38 UTC] No.43999569[source]▶

>>43999248 #

o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.

replies(2): >>44000022 #>>44000083 #

1. kenjackson ◴[15 May 25 22:30 UTC] No.44000022[source]▶

>>43999569 #

I use o3 for anything math or coding related. 4o is good for things like, "my knee hurts when I do this and that -- what might it be?"

replies(1): >>44002814 #

2. TeMPOraL ◴[16 May 25 07:51 UTC] No.44002814[source]▶

>>44000022 (TP) #

In ChatGPT, at this point I use 4o pretty much only for image generation; it's the one feature that's unique to it and is mind-blowingly good. For everything else, I default to o3.

For coding, I stick to Claude 3.5 / 3.7 and recently Gemini 2.5 Pro. I sometimes use o3 in ChatGPT when I can't be arsed to fire up Aider, or really need to use its search features to figure out how to do something (e.g. pinouts for some old TFT screens for ESP32 and Raspberry Pi, most recently).

↑