Tools: Code Is All You Need

(lucumr.pocoo.org)

313 points Bogdanp | 1 comments | 03 Jul 25 10:51 UTC | HN request time: 0s | source

Show context

simonw ◴[03 Jul 25 14:22 UTC] No.44455353[source]▶

Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

replies(7): >>44455435 #>>44455688 #>>44456119 #>>44456183 #>>44456944 #>>44457269 #>>44458980 #

nico ◴[03 Jul 25 15:34 UTC] No.44456119[source]▶

>>44455353 #

> LLM in a sandbox using tools in a loop, you can brute force that problem

Does this require using big models through their APIs and spending a lot of tokens?

Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?

I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production

replies(1): >>44456168 #

simonw ◴[03 Jul 25 15:39 UTC] No.44456168[source]▶

>>44456119 #

The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.

I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.

replies(4): >>44456472 #>>44457060 #>>44457080 #>>44458336 #

nico ◴[03 Jul 25 17:02 UTC] No.44457080[source]▶

>>44456168 #

> it needs good tool calling, not an encyclopedic knowledge of the world

I wonder if there are any groups/companies out there building something like this

Would love to have models that only know 1 or 2 languages (eg. python + js), but are great at them and at tool calling. Definitely don't need my coding agent to know all of Wikipedia and translating between 10 different languages

replies(1): >>44457675 #

1. johnsmith1840 ◴[03 Jul 25 18:05 UTC] No.44457675[source]▶

>>44457080 #

Given 2 datasets:

1. A special code dataset 2. A bunch of "unrelated" books

My understanding is that the model trained on just the first will never beat the model trained on both. Bloomberg model is my favorite example of this.

If you can squirell away special data then that special data plus everything else will beat the any other models. But that's basically what openai, google, and anthropic are all currently doing.

↑