←back to thread

Tools: Code Is All You Need

(lucumr.pocoo.org)
313 points Bogdanp | 1 comments | | HN request time: 0s | source
Show context
simonw ◴[] No.44455353[source]
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

replies(7): >>44455435 #>>44455688 #>>44456119 #>>44456183 #>>44456944 #>>44457269 #>>44458980 #
nico ◴[] No.44456119[source]
> LLM in a sandbox using tools in a loop, you can brute force that problem

Does this require using big models through their APIs and spending a lot of tokens?

Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?

I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production

replies(1): >>44456168 #
simonw ◴[] No.44456168[source]
The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.

I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.

replies(4): >>44456472 #>>44457060 #>>44457080 #>>44458336 #
1. never_inline ◴[] No.44457060[source]
Wasn't there a tool calling benchmark by docker guys which concluded qwen models are nearly as good as GPT? What is your experience about it?

Personally I am convinced JSON is a bad format for LLMs and code orchestration in python-ish DSL is the future. But local models are pretty bad at code gen too.