The unreasonable effectiveness of an LLM agent loop with tool use

(sketch.dev)

435 points crawshaw | 1 comments | 15 May 25 19:33 UTC | HN request time: 0.215s | source

Show context

_bin_ ◴[15 May 25 20:01 UTC] No.43998743[source]▶

I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.

3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.

I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

replies(3): >>43998797 #>>43999022 #>>43999599 #

agilebyte ◴[15 May 25 20:07 UTC] No.43998797[source]▶

>>43998743 #

I am avoiding the cost of API access by using the chat/ui instead, in my case Google Gemini 2.5 Pro with the high token window. Repomix a whole repo. Paste it in with a standard prompt saying "return full source" (it tends to not follow this instruction after a few back and forths) and then apply the result back on top of the repo (vibe coded https://github.com/radekstepan/apply-llm-changes to help me with that). Else yeah, $5 spent on Cline with Claude 3.7 and instead of fixing my tests, I end up with if/else statements in the source code to make the tests pass.

replies(4): >>43999036 #>>43999080 #>>43999160 #>>44013021 #

1. harvey9 ◴[15 May 25 20:33 UTC] No.43999036[source]▶

>>43998797 #

Guess it was trained by scraping thedailywtf.com

↑