←back to thread

Tools: Code Is All You Need

(lucumr.pocoo.org)
313 points Bogdanp | 1 comments | | HN request time: 0.429s | source
Show context
simonw ◴[] No.44455353[source]
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

replies(7): >>44455435 #>>44455688 #>>44456119 #>>44456183 #>>44456944 #>>44457269 #>>44458980 #
vunderba ◴[] No.44456944[source]
> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.

Your test case seems like a quintessential example where you're missing that last step.

Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"

Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.

replies(2): >>44458225 #>>44458867 #
1. simonw ◴[] No.44458225[source]
That's exactly why I like using Mandelbrot as a demo: it's perfect for "superficial visual inspection".

With a bunch more work I could likely have got a vision LLM to do that visual inspection for me in the assembly example, but having a human in the loop for that was much more productive.