now build it for old codebase, let's see how precisely it edits or removes features without breaking the whole codebase
lets see how many tokens it consumes per bug fix or feature addition.
The whole thing runs on these prompts: https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...
Your task: {{task}}. Please reply
with a single shell command in
triple backticks.
To finish, the first line of the
output of the shell command must be
'COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT'.
1. Precompute frequently used knowledge and surface early. For example repository structure, os information, system time.
2. Anticipate next tool calls. If a match is not found while editing, instead of simply failing, return closest matching snippet. If read file tool gets a directory, return directory contents.
3. Parallel tool calls. Claude needs either a batch tool or special scaffolding to promote parallel tool calls. Single tool call per turn is very expensive.
Are there any other such general ideas?
I am still looking for a good "memory" solution, so far running without it. Haven't looked too deep into it.
Not sure how next tool call be predicted.
I am still using serial tool calls as i do not have any subagents, i just use fast inference models for directly tools calls. It works so fast, i doubt i'll benefit from parallel anything.
i just wrote this comment so people aren't under false belief that it's pretty much all coding agents do, making all this fault tolerant with good ux is lot of work.
I think the future will be dashboards/HUDs (there was an article on HN about this a bit ago and I agree). You'll get preview windows, dynamic action buttons, a kanban board, status updates, and still the ability to edit code yourself, of course.
The single-file lineup of agentic actions with user input, in a terminal chat UI, just isn't gonna cut it for more complicated problems. You need faster error reporting from multiple sources, you need to be able to correct the LLM and break it out of error loops. You won't want to be at the terminal even though it feels comfortable because it's just the wrong HCI tool for more complicated tasks. Can you tell I really dislike using these overly-simple agents?
You'll get a much better result with a dashboard/HUD. The future of agents is that multiple of them will be working at once on the codebase and they'll be good enough that you'll want more of a status-update-confirm loop than an agentic code editing tool update.
Also required is better code editing. You want to avoid the LLM making changes in your code unrelated to the requested problem. Gemini CLI often does a 'grep' for keywords in your prompt to find the right file, but your prompt was casual and doesn't contain the right keywords so you end up with the agent making changes that aren't intended.
Obviously I am working in this space so that's where my opinions come from. I have a prototype HUD-style webapp builder agent that is online right now if you'd like to check it out:
It's not got everything I said above - it's a work-in-progress. Would love any feedback you have on my take on a more complicated, involved, and narrow-focus agentic workflow. It only builds flask webapps right now, strict limits on what it can do (no cron etc yet) but it does have a database you can use in your projects. I put a lot of work into the error flow as well, as that seems like the biggest issue with a lot of agentic code tools.
One last technical note: I blogged about using AST transformations when getting LLMs to modify code. I think that using diffs or rewriting the whole file isn't the right solution either. I think that having the LLM write code that modifies your code and then running that code to affect the modifications is the way forward. We'll see I guess. Blog post: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...
Gemini CLI still uses archaic whole file format for edits, it's not a good representative of current state of coding agents.
that's not the case with a codebase, where things are littered around in tune with specific model of organisation the developer had in mind.
You wish
This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:
> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.
Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.
Surely listing files, searching a repo, editing a file can all be achieved with bash?
Or is this what's demonstrated by https://news.ycombinator.com/item?id=45001234?
There are a few models that solve 30-50% of (new) tasks pulled from real-wolrd repos. So ... yeah.
If everything goes through bash then you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
If you have listing files as a separate tool then you can also enforce that the agent doesn't list any files outside of the project directory.
I've built a SWE agent too (for fun), check it out => https://github.com/myriade-ai/autocode
My best guess is they started out with a limited subset of tools and realised they can just give it bash later.
One of the reasons why you get better performance if you give them the other tools is that there has been some reinforcement learning on Sonne with all these tools. The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions. The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.
This saves the LLM from having to do multiple low level clicking and typing and keeps it on track. Help the poor model out, will ya!?
Why the unnecessary generated AI pictures in between?
Why put everything that could have been a bullet point into it's own individual picture (even if it's not AI generated)? It's very visually distracting, breaks the flow of reading, and it's less accessible as all the picture lack alt-text.
---
I see that it's based on a conference talk, so it's possibly just 1:1 the slides. If that's the case please put it up in it's native conference format, rather than this.
> The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.
This was the only informative sentence in the reply. Can you please go on in this manner - it was an important question.This project and this post are for the curious and for the learners.
This is a very strong argument for more specific tools, thanks!
Interesting! This didn't seem to be the case in the OP's examples - for instance using a list_files tool and then checking if the json result included README vs bash [ -f README ]
https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...
> right tools allow small models to perform better than undirected tool like bash to do everything.
Interesting enough the newer mini swe agent was refutation of this hypothesis for very large LLMs from the original swe agent paper (https://arxiv.org/pdf/2405.15793) assuming that specialized tools work better.
(Also I think Gemini is significantly better when it comes to the context rot, in my experience 100K--300K tokens were required for symptoms to appear. So burning tokens is less problematic with Gemini.)
Money. Replace "tokens" with "money". You just keep throwing money at the loop, and then you've got yourself an agent.
Yes, it is. Not only in the department of good design in UX, but these LLMs keep evolving. They are software with different versions, and these different versions are continually deployed, which changes the behavior of the underlying model. So the harness needs to be continually updated to remain competitive.
They are great for basic tasks like summarization and translation but for the best results from coding agents and from 90% of so-called AI startups who are using these APIs, they are all purchasing tokens.
No different to operating a slot-machine towards vibe-coders who are the AI companies favourite type of customer - spending endless amounts of money on tokens for another spin at fixing an error they don't understand.
And remember to avoid feeding the trolls.
I live in the “valley”. I battle depression daily that I had before LLMs.
Using LLMs and false guardrails to watchdog inherently deceitful output is a bad system smell.
I know most are “on it”, and I’ve written a coding agent.
But why is this page designed like some brainwashing repetitive Orwellian mantra?
If it’s perceived that we need that, then we’re having to overcome something, and that something is common sense.
So maybe we’ll happily write our coding agents with the intent to stand on the shoulders of a giant.
But everyone knows we’re building the technological equivalent of a crystal meth empire.
There are theoretically impossible things to do, if you buy into only the basics. If you open your mind, anything is achievable; you just need to break out of the box you’re in.
If enough people keep feeding in that we need a time machine, the revolution will play out in all the timelines. Without it, Sarah Connor is lost.
That's why critique has value. To the original author/artist (if they see it), but also to everyone else who sees it. "Oh, I was going to intersperse text slides with a transcript, but I remember how offputting that was once on HN, so let's skip the slides."
But with edge-case exceptions aside, yes, tokens cost money.
https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...
The disconnect here is that models aren't really "text" based, but token based, like how compilers don't use the code itself but a series of tokens that can include keywords, brackets, and other things. The output can include words but also metadata
There is no training on a tool with that name. But it likely also doesn't need training because the parameter is just a path and that's a pretty basic tool.
On the other hand to know how to execute a bash command, you need to know bash. Bash is a known tool to the Claude models [1] and so is text editing [2]. You're supposed to reference those in the tool listing but at least from my testing, the moment you call a tool "bash", Claude makes plenty of assumptions about what the point of this thing is.
[1]: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...
[2]: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...
As far as I know that's what's happening. They are training it to return tool responses when it's unsure about the answer or instructed to do so. There are generic tool trainings for just following the response format, and then probably there are some tool specific trainings. For instance gpt-oss loves to use the search tool, even if it's not mentioned anywhere. Anthropic lists well known tools in their document (eg: text_editor, bash). They are likely to have been trained specifically to follow some deeper semantics compared to just generic tool usage.
The whole thing is pretty brittle and tool invocations are just taking place via in-band signalling, delineated by special tokens or token sequences.
Negative things like IP stealing "AI" can be stopped as well, and the population is increasingly watchful and will organize itself at some point.
* The only true interface with an LLM is tokens. (No separation between control and data channels.)
* The model api layer injects instructions on tool calling and a list of available tools into the base prompt, with documentation on what those tools do.
* Tool calling is delineated by special tokens. When a model wants to call a tool, it adds a special block to the response that contains the magic token(s) along with the name of the tool and any params. The api layer then extracts this and forms a structured json response in some tool_calls parameter or whatever that is sent in the api response to the user. The result of the tool coming back from the user through the tool calling api is then encoded with special tokens and injected.
* Presumably, the api layer prevents the user from injecting such tokens themselves.
* SotA Models are good at tool calls because they have been heavily fine-tuned on them, with all sorts of tasks that involve tool calls, like bash invocations. The fine-tuning is both to get them good at tool calls in general, and also probably involves specific tool calls that the model provider wants them to be good at, such as the Claude Sonnet model getting fine-tuned on the specific tools Claude Code uses.
Sometimes it amazes me that this all works so well, but it does. You are right to put your finger on the fine-tuning, as it’s critical for making tool calling work well. Tool calling works without fine-tuning, but it’s going to be more hit-or-miss.
If you need to edit the source, just use patch with the bash tool.
What's the efficiency issue?
The good: cool to know more about the agents loops, different types of LLMs, ideas for prompting. I definitely wanna try it - would be cool to prompt the agent to build some feature, have it in a loop of building, testing, reviewing and, go have breakfast, come back and only have to tweak a reasonably legible and working code.
The bad: some of these concepts, maybe they are bit meant to mislead, buy really trigger my 'snake oil alert'. The AI compass? Agentic VS non agentic LLMs? People who are getting work done between meetings? Maybe this is more of a vibe thing, so it's not trivial / logical to explain, but in this space there's so many loosely defined concepts that that really trigger skepticism in me (and others).
The ugly: 1 word slides ;p
I guess that it's only a matter of finetuning.
LLM have lots of experience with bash so I get they figure out how to work with it. They don't have experience with custom tools you provide it.
And also, LLM "tools" as we know it need better design (to show states, dynamic actions).
Given both, AI with the right tools will outperform AI with generic and uncontrolled tool.
I just had to laugh and link this other article by the author https://ghuntley.com/internet/
Im not sure if he has a solar array but I assume so?