The unreasonable effectiveness of an LLM agent loop with tool use

(sketch.dev)

Show context

kgeist ◴[15 May 25 20:28 UTC] No.43998994[source]▶

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

replies(31): >>43999028 #>>43999055 #>>43999097 #>>43999162 #>>43999169 #>>43999248 #>>43999263 #>>43999272 #>>43999296 #>>43999300 #>>43999358 #>>43999373 #>>43999390 #>>43999401 #>>43999402 #>>43999497 #>>43999556 #>>43999610 #>>43999916 #>>44000527 #>>44000695 #>>44001136 #>>44001181 #>>44001568 #>>44001697 #>>44002185 #>>44002837 #>>44003198 #>>44003824 #>>44008480 #>>44048487 #

1. codethief ◴[15 May 25 21:27 UTC] No.43999497[source]▶

>>43998994 #

The other day I used the Cline plugin for VSCode with Claude to create an Android app prototype from "scratch", i.e. starting from the usual template given to you by Android Studio. It produced several thousand lines of code, there was not a single compilation error, and the app ended up doing exactly what I wanted – modulo a bug or two, which were caused not by the LLM's stupidity but by weird undocumented behavior of the rather arcane Android API in question. (Which is exactly why I wanted a quick prototype.)

After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

replies(2): >>44000746 #>>44002822 #

2. cheema33 ◴[16 May 25 00:37 UTC] No.44000746[source]▶

>>43999497 (TP) #

> I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

This x 100. I get so much better quality code if I have LLMs review each other's code and apply corrections. It is ridiculously effective.

replies(1): >>44000957 #

3. lftl ◴[16 May 25 01:18 UTC] No.44000957[source]▶

>>44000746 #

Can you elaborate a little more on your setup? Are you manually copyong and pasting code from one LLM to another, or do you have some automated workflow for this?

replies(2): >>44006655 #>>44037259 #

4. suddenlybananas ◴[16 May 25 07:52 UTC] No.44002822[source]▶

>>43999497 (TP) #

What was the app? It could plausibly be something that has an open source equivalent already in the training data.

5. htsh ◴[16 May 25 15:29 UTC] No.44006655{3}[source]▶

>>44000957 #

I have been doing this with claude code and openai codex and/or cline. One of the three takes the first pass (usually claude code, sometimes codex), then I will have cline / gemini 2.5 do a "code review" and offer suggestions for fixes before it applies them.

6. cheema33 ◴[20 May 25 02:44 UTC] No.44037259{3}[source]▶

>>44000957 #

No manual copy paste. That is not good use of time. I work in a git repo and point multiple LLMs at it.

One LLM reviews existing code and the new requirement and then creates a PRD. I usually use Augment Code for this because it has a good index of all local code.

I then ask Google Gemini to review the PRD and validate it and find ways to improve it. I then ask Gemini to create a comprehensive implementation plan. It frequently creates a 13 step plan. It would usually take me a month to do this work.

I then start a new session of Augment Code, feed it the PRD and one of the 13 tasks at a time. Whatever work it does, it checks it in a feature branch with detailed git commit comment. I then ask Gemini to review the output of each task and provide feedback. It frequently finds issues with implementation or areas of improvement.

All of this managed by using git. I make LLMs use git. I think would go insane if I had to copy/paste this much stuff.

I have a recipe of prompts that I copy/paste. I am trying to find ways to cut that down and making slow progress in this regard. There are tools like "Task Master" (https://github.com/eyaltoledano/claude-task-master) that do a good job of automating this workflow. However this tool doesn't allow much customization. e.g. Have LLMs review each other's work.

But, maybe I can get LLMs to customize that part for me...

↑