The unreasonable effectiveness of an LLM agent loop with tool use

(sketch.dev)

447 points crawshaw | 1 comments | 15 May 25 19:33 UTC | HN request time: 0s | source

Show context

_bin_ ◴[15 May 25 20:01 UTC] No.43998743[source]▶

I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.

3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.

I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

replies(3): >>43998797 #>>43999022 #>>43999599 #

agilebyte ◴[15 May 25 20:07 UTC] No.43998797[source]▶

>>43998743 #

I am avoiding the cost of API access by using the chat/ui instead, in my case Google Gemini 2.5 Pro with the high token window. Repomix a whole repo. Paste it in with a standard prompt saying "return full source" (it tends to not follow this instruction after a few back and forths) and then apply the result back on top of the repo (vibe coded https://github.com/radekstepan/apply-llm-changes to help me with that). Else yeah, $5 spent on Cline with Claude 3.7 and instead of fixing my tests, I end up with if/else statements in the source code to make the tests pass.

replies(4): >>43999036 #>>43999080 #>>43999160 #>>44013021 #

actsasbuffoon ◴[15 May 25 20:37 UTC] No.43999080[source]▶

>>43998797 #

I decided to experiment with Claude Code this month. The other day it decided the best way to fix the spec was to add a conditional to the test that causes it to return true before getting to the thing that was actually supposed to be tested.

I’m finding it useful for really tedious stuff like doing complex, multi step terminal operations. For the coding… it’s not been great.

replies(2): >>43999202 #>>43999758 #

1. christophilus ◴[15 May 25 21:58 UTC] No.43999758[source]▶

>>43999080 #

Well, that’s proof that it used my GitHub projects in its training data.

↑