The unreasonable effectiveness of an LLM agent loop with tool use

(sketch.dev)

447 points crawshaw | 2 comments | 15 May 25 19:33 UTC | HN request time: 0.543s | source

Show context

_bin_ ◴[15 May 25 20:01 UTC] No.43998743[source]▶

I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.

3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.

I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

replies(3): >>43998797 #>>43999022 #>>43999599 #

johnsmith1840 ◴[15 May 25 21:40 UTC] No.43999599[source]▶

>>43998743 #

I seem to be alone in this but the only methods truly good at coding are slow heavy test time compute models.

o1-pro and o1-preview are the only models I've ever used that can reliably update and work with 1000 LOC without error.

I don't let o3 write any code unless it's very small. Any "cheap" model will hallucinate or fail massively when pushed.

One good tip I've done lately. Remove all comments in your code before passing or using LLMs, don't let LLM generated comments persist under any circumstance.

replies(2): >>43999812 #>>44002083 #

1. doug_durham ◴[16 May 25 05:25 UTC] No.44002083[source]▶

>>43999599 #

I never have LLMs work on 1000 LOC. I don't think that's the value-add. Instead I use it a the function and class level to accelerate my work. The thought of having any agent human or computer run amok in my code makes me uncomfortable. At the end of the day I'm still accountable for the work, and I have to read and comprehend everything. If do it piecewise I it makes tracking the work easier.

replies(1): >>44012545 #

2. johnsmith1840 ◴[17 May 25 07:01 UTC] No.44012545[source]▶

>>44002083 (TP) #

Big test time compute LLMs can easily handle 1k depending on logic density and prompt densitity.

Never an agent, every independent step an LLM takes is dangerous. My method is much more about taking the largest and safest single step at a time possible. If it can't do it in one step I narrow down until it can.

↑