←back to thread

Tools: Code Is All You Need

(lucumr.pocoo.org)
313 points Bogdanp | 7 comments | | HN request time: 0.001s | source | bottom
Show context
simonw ◴[] No.44455353[source]
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

replies(7): >>44455435 #>>44455688 #>>44456119 #>>44456183 #>>44456944 #>>44457269 #>>44458980 #
1. nico ◴[] No.44456119[source]
> LLM in a sandbox using tools in a loop, you can brute force that problem

Does this require using big models through their APIs and spending a lot of tokens?

Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?

I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production

replies(1): >>44456168 #
2. simonw ◴[] No.44456168[source]
The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.

I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.

replies(4): >>44456472 #>>44457060 #>>44457080 #>>44458336 #
3. pxc ◴[] No.44456472[source]
There's a fine-tune of Qwen3 4B called "Jan Nano" that I started playing with yesterday, which is basically just fine-tuned to be more inclined to look things up via web searches than to answer them "off the dome". It's not good-good, but it does seem to have a much lower effective hallucination rate than other models of its size.

It seems like maybe similar approaches could be used for coding tasks, especially with tool calls for reading man pages, info pages, running `tldr`, specifically consulting Stack Overflow, etc. Some of the recent small MoE models from Chinese companies are significantly smarter than models like Qwen 4B, but run about as quickly, so maybe on systems with high RAM or high unified memory, even with middling GPUs, they could be genuinely useful for coding if they are made to be avoid doing anything without tool use.

4. never_inline ◴[] No.44457060[source]
Wasn't there a tool calling benchmark by docker guys which concluded qwen models are nearly as good as GPT? What is your experience about it?

Personally I am convinced JSON is a bad format for LLMs and code orchestration in python-ish DSL is the future. But local models are pretty bad at code gen too.

5. nico ◴[] No.44457080[source]
> it needs good tool calling, not an encyclopedic knowledge of the world

I wonder if there are any groups/companies out there building something like this

Would love to have models that only know 1 or 2 languages (eg. python + js), but are great at them and at tool calling. Definitely don't need my coding agent to know all of Wikipedia and translating between 10 different languages

replies(1): >>44457675 #
6. johnsmith1840 ◴[] No.44457675{3}[source]
Given 2 datasets:

1. A special code dataset 2. A bunch of "unrelated" books

My understanding is that the model trained on just the first will never beat the model trained on both. Bloomberg model is my favorite example of this.

If you can squirell away special data then that special data plus everything else will beat the any other models. But that's basically what openai, google, and anthropic are all currently doing.

7. e12e ◴[] No.44458336[source]
I wonder if common lisp with repl and debugger could provide a better tool than your example with nasm wrapped via apt in Docker...

Essentially just giving LLMs more state of the art systems made for incremental development?

Ed: looks like that sort of exists: https://github.com/bhauman/clojure-mcp

(Would also be interesting if one could have a few LLMs working together on red/green TDD approach - have an orchestrator that parse requirements, and dispatch a red goblin to write a failing test; a green goblin that writes code until the test pass; and then some kind of hobgoblin to refactor code, keeping test(s) green - working with the orchestrator to "accept" a given feature as done and move on to the next...

With any luck the resulting code might be a bit more transparent (stricter form) than other LLM code)?