Tools: Code Is All You Need

(lucumr.pocoo.org)

Show context

simonw ◴[03 Jul 25 14:22 UTC] No.44455353[source]▶

Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

replies(7): >>44455435 #>>44455688 #>>44456119 #>>44456183 #>>44456944 #>>44457269 #>>44458980 #

1. vunderba ◴[03 Jul 25 16:47 UTC] No.44456944[source]▶

>>44455353 #

> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.

Your test case seems like a quintessential example where you're missing that last step.

Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"

Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.

replies(2): >>44458225 #>>44458867 #

2. simonw ◴[03 Jul 25 19:06 UTC] No.44458225[source]▶

>>44456944 (TP) #

That's exactly why I like using Mandelbrot as a demo: it's perfect for "superficial visual inspection".

With a bunch more work I could likely have got a vision LLM to do that visual inspection for me in the assembly example, but having a human in the loop for that was much more productive.

3. shepherdjerred ◴[03 Jul 25 20:24 UTC] No.44458867[source]▶

>>44456944 (TP) #

Are fractals or x86 assembly representative of most dev work?

replies(1): >>44458989 #

4. nartho ◴[03 Jul 25 20:40 UTC] No.44458989[source]▶

>>44458867 #

I think it's irrelevant. The point they are trying to make is anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.

replies(2): >>44459585 #>>44466793 #

5. diggan ◴[03 Jul 25 22:11 UTC] No.44459585{3}[source]▶

>>44458989 #

> anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.

I regularly use LLMs to code specific functions I don't necessarily understand the internals of. Most of the time I do that, it's something math-heavy for a game. Just like any function, I put it under automated and manual tests. Still, I review and try to gain some intuition about what is happening, but it is still very far of my area of expertise, yet I can be sure it works as I expect it to.

6. shepherdjerred ◴[04 Jul 25 18:25 UTC] No.44466793{3}[source]▶

>>44458989 #

I think you're vastly overestimating the amount of knowledge the average developer has in their "area of expertise"

I'm not saying that's a good thing, just that LLMs are no worse than the bottom 50% of devs.

↑