Tools: Code Is All You Need

(lucumr.pocoo.org)

313 points Bogdanp | 1 comments | 03 Jul 25 10:51 UTC | HN request time: 0s | source

Show context

simonw ◴[03 Jul 25 14:22 UTC] No.44455353[source]▶

Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

replies(7): >>44455435 #>>44455688 #>>44456119 #>>44456183 #>>44456944 #>>44457269 #>>44458980 #

nico ◴[03 Jul 25 15:34 UTC] No.44456119[source]▶

>>44455353 #

> LLM in a sandbox using tools in a loop, you can brute force that problem

Does this require using big models through their APIs and spending a lot of tokens?

Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?

I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production

replies(1): >>44456168 #

simonw ◴[03 Jul 25 15:39 UTC] No.44456168[source]▶

>>44456119 #

The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.

I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.

replies(4): >>44456472 #>>44457060 #>>44457080 #>>44458336 #

1. never_inline ◴[03 Jul 25 16:59 UTC] No.44457060[source]▶

>>44456168 #

Wasn't there a tool calling benchmark by docker guys which concluded qwen models are nearly as good as GPT? What is your experience about it?

Personally I am convinced JSON is a bad format for LLMs and code orchestration in python-ish DSL is the future. But local models are pretty bad at code gen too.

↑