Most active commenters

ghuntley(17)
faangguyindia(7)
normie3000(6)
BenderV(4)
cryptoz(3)
the_mitsuhiko(3)
afro88(3)

Popular/hot comments

>>45001234 #
>>45001738 #
>>45001426 #
>>45001548 #
>>45002125 #
>>45002208 #
>>45003491 #
>>45003615 #
>>45004504 #

How to build a coding agent

(ghuntley.com)

1. ofirpress ◴[24 Aug 25 03:55 UTC] No.45001234[source]▶

We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: https://github.com/SWE-agent/mini-swe-agent

replies(7): >>45001287 #>>45001548 #>>45001716 #>>45001737 #>>45002061 #>>45002110 #>>45009789 #

2. ghuntley ◴[24 Aug 25 04:05 UTC] No.45001287[source]▶

>>45001234 #

cheers i'll add it in.

3. faangguyindia ◴[24 Aug 25 04:33 UTC] No.45001426[source]▶

>>45001051 (OP) #

Anyone can build a coding agent which works on a) fresh code base b) when you've unlimited token budget

now build it for old codebase, let's see how precisely it edits or removes features without breaking the whole codebase

lets see how many tokens it consumes per bug fix or feature addition.

replies(4): >>45001529 #>>45001567 #>>45001784 #>>45001830 #

4. simonw ◴[24 Aug 25 04:59 UTC] No.45001529[source]▶

>>45001426 #

This comment belongs in a discussion about using LLMs to help write code for large existing systems - it's a bit out of place in a discussion about a tutorial on building coding agents to help people understand how the basic tools-in-a-loop pattern works.

replies(1): >>45001675 #

5. simonw ◴[24 Aug 25 05:06 UTC] No.45001548[source]▶

>>45001234 #

OK that really is pretty simple, thanks for sharing.

The whole thing runs on these prompts: https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...

  Your task: {{task}}. Please reply
  with a single shell command in
  triple backticks.
  
  To finish, the first line of the
  output of the shell command must be
  'COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT'.

replies(3): >>45002285 #>>45002729 #>>45003054 #

6. pcwelder ◴[24 Aug 25 05:12 UTC] No.45001567[source]▶

>>45001426 #

Agree. To reduce costs:

1. Precompute frequently used knowledge and surface early. For example repository structure, os information, system time.

2. Anticipate next tool calls. If a match is not found while editing, instead of simply failing, return closest matching snippet. If read file tool gets a directory, return directory contents.

3. Parallel tool calls. Claude needs either a batch tool or special scaffolding to promote parallel tool calls. Single tool call per turn is very expensive.

Are there any other such general ideas?

replies(1): >>45001667 #

7. faangguyindia ◴[24 Aug 25 05:33 UTC] No.45001667{3}[source]▶

>>45001567 #

that info can be just included in preffix which is cache by LLM, reducing cost by 70-80% average. System time varies, so it's not good idea to specify it in prompt, better to make a function out of it to avoid cache invalidation.

I am still looking for a good "memory" solution, so far running without it. Haven't looked too deep into it.

Not sure how next tool call be predicted.

I am still using serial tool calls as i do not have any subagents, i just use fast inference models for directly tools calls. It works so fast, i doubt i'll benefit from parallel anything.

8. faangguyindia ◴[24 Aug 25 05:35 UTC] No.45001675{3}[source]▶

>>45001529 #

anyone who used those coding agent can already see how it works, you can usually see agent fetching files, running commands, listing files and directories.

i just wrote this comment so people aren't under false belief that it's pretty much all coding agents do, making all this fault tolerant with good ux is lot of work.

replies(1): >>45003705 #

9. cryptoz ◴[24 Aug 25 05:40 UTC] No.45001692[source]▶

>>45001051 (OP) #

I really think the current trend of CLI coding agents isn't going to be the future. They're cool but they are _too simple_. Gemini CLI often makes incorrect edits and gets confused, at least on my codebase. Just like ChatGPT would do in a longer chat where the context gets lost: random, unnecessary and often harmful edits are made confidently. Extraneous parts of the codebase are modified when you didn't ask for it. They get stuck in loops for an hour trying to solve a problem, "solving it", and then you have to tell the LLM the problem isn't solved, the error message is the same, etc.

I think the future will be dashboards/HUDs (there was an article on HN about this a bit ago and I agree). You'll get preview windows, dynamic action buttons, a kanban board, status updates, and still the ability to edit code yourself, of course.

The single-file lineup of agentic actions with user input, in a terminal chat UI, just isn't gonna cut it for more complicated problems. You need faster error reporting from multiple sources, you need to be able to correct the LLM and break it out of error loops. You won't want to be at the terminal even though it feels comfortable because it's just the wrong HCI tool for more complicated tasks. Can you tell I really dislike using these overly-simple agents?

You'll get a much better result with a dashboard/HUD. The future of agents is that multiple of them will be working at once on the codebase and they'll be good enough that you'll want more of a status-update-confirm loop than an agentic code editing tool update.

Also required is better code editing. You want to avoid the LLM making changes in your code unrelated to the requested problem. Gemini CLI often does a 'grep' for keywords in your prompt to find the right file, but your prompt was casual and doesn't contain the right keywords so you end up with the agent making changes that aren't intended.

Obviously I am working in this space so that's where my opinions come from. I have a prototype HUD-style webapp builder agent that is online right now if you'd like to check it out:

https://codeplusequalsai.com/

It's not got everything I said above - it's a work-in-progress. Would love any feedback you have on my take on a more complicated, involved, and narrow-focus agentic workflow. It only builds flask webapps right now, strict limits on what it can do (no cron etc yet) but it does have a database you can use in your projects. I put a lot of work into the error flow as well, as that seems like the biggest issue with a lot of agentic code tools.

One last technical note: I blogged about using AST transformations when getting LLMs to modify code. I think that using diffs or rewriting the whole file isn't the right solution either. I think that having the LLM write code that modifies your code and then running that code to affect the modifications is the way forward. We'll see I guess. Blog post: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

replies(2): >>45001711 #>>45001959 #

10. faangguyindia ◴[24 Aug 25 05:45 UTC] No.45001711[source]▶

>>45001692 #

>Gemini CLI often makes incorrect edits and gets confused

Gemini CLI still uses archaic whole file format for edits, it's not a good representative of current state of coding agents.

replies(2): >>45001721 #>>45002398 #

11. faangguyindia ◴[24 Aug 25 05:46 UTC] No.45001716[source]▶

>>45001234 #

when a problem is entirely self contained in a file, it's very easy to edit it with LLM.

that's not the case with a codebase, where things are littered around in tune with specific model of organisation the developer had in mind.

replies(2): >>45001723 #>>45002076 #

12. revskill ◴[24 Aug 25 05:47 UTC] No.45001717[source]▶

>>45001051 (OP) #

Where is the program synthesis? My way of thinking is given primitives as tools, i want the model to construct and return the program to execute.

Of course following nix philosophy is another way.

replies(1): >>45003009 #

13. cryptoz ◴[24 Aug 25 05:48 UTC] No.45001721{3}[source]▶

>>45001711 #

Oh that's wild, I did suspect that but didn't know it outright. Mind-blowing Google would release that kind of thing, I had wondered why it sucked so much haha. Okay so what is a good representation of the current state of coding agents? Which one should I try that does a better job at code modifications?

replies(2): >>45001756 #>>45001817 #

14. koakuma-chan ◴[24 Aug 25 05:48 UTC] No.45001723{3}[source]▶

>>45001716 #

> in tune with specific model of organisation

You wish

15. meander_water ◴[24 Aug 25 05:51 UTC] No.45001737[source]▶

>>45001234 #

> 1. Analyze the codebase by finding and reading relevant files 2. Create a script to reproduce the issue 3. Edit the source code to resolve the issue 4. Verify your fix works by running your script again 5. Test edge cases to ensure your fix is robust

This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:

> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.

Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.

replies(1): >>45006960 #

16. normie3000 ◴[24 Aug 25 05:51 UTC] No.45001738[source]▶

>>45001051 (OP) #

Why are any of the tools beyond the bash tool required?

Surely listing files, searching a repo, editing a file can all be achieved with bash?

Or is this what's demonstrated by https://news.ycombinator.com/item?id=45001234?

replies(6): >>45001930 #>>45001984 #>>45002135 #>>45002208 #>>45002306 #>>45003670 #

17. mrugge ◴[24 Aug 25 05:55 UTC] No.45001756{4}[source]▶

>>45001721 #

claude code (with max subscription), cursor-agent (with usage based pricing)

18. righthand ◴[24 Aug 25 06:04 UTC] No.45001784[source]▶

>>45001426 #

Surprise, as rambunctious dev who’s socially hacked their way through promotions, I will just convince our manager we need to rewrite the platform in a new stack or convince them that I need to write a new server to handle the feature. No old tech needed!

19. NitpickLawyer ◴[24 Aug 25 06:14 UTC] No.45001817{4}[source]▶

>>45001721 #

Claude code is the strongest atm, but roocode or cline (vscode extensions) can also work well. Roo with gpt5-mini (so cheap, pretty fast) does diff based edits w/ good coordination over a task, and finishes most tasks that I tried. It even calls them "surgical diffs" :D

20. NitpickLawyer ◴[24 Aug 25 06:18 UTC] No.45001830[source]▶

>>45001426 #

There's "swe re-bench", a benchmark that tracks model release dates, and you can see how the model did for "real-world" bugs that got submitted on github after the model was released. (obviously works best for open models).

There are a few models that solve 30-50% of (new) tasks pulled from real-wolrd repos. So ... yeah.

21. zarzavat ◴[24 Aug 25 06:43 UTC] No.45001930[source]▶

>>45001738 #

Separate tools is simpler than having everything go through bash.

If everything goes through bash then you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.

If you have listing files as a separate tool then you can also enforce that the agent doesn't list any files outside of the project directory.

replies(1): >>45002621 #

22. sitkack ◴[24 Aug 25 06:52 UTC] No.45001959[source]▶

>>45001692 #

You are wasting your time and everyone elses with Gemini, it is the worst.

replies(1): >>45002122 #

23. BenderV ◴[24 Aug 25 06:57 UTC] No.45001984[source]▶

>>45001738 #

Why do humans need a IDE when we could do anything in a shell? Interface give you the informations you need at a given moment and the actions you can take.

replies(1): >>45002679 #

24. Western0 ◴[24 Aug 25 06:59 UTC] No.45001998[source]▶

>>45001051 (OP) #

Instead of writing about how to build an agent, show us one project that this agent has built.

replies(1): >>45003615 #

25. Teever ◴[24 Aug 25 07:13 UTC] No.45002061[source]▶

>>45001234 #

What sort of results have you had from running it on its own codebase?

26. fmbb ◴[24 Aug 25 07:16 UTC] No.45002076{3}[source]▶

>>45001716 #

Lumpers win again!

https://en.wikipedia.org/wiki/Lumpers_and_splitters

27. prodimmune ◴[24 Aug 25 07:26 UTC] No.45002107[source]▶

>>45001051 (OP) #

For me, the post is missing an explanation of the reason why I would want to build my own coding agent instead of just using one of the publicly available ones.

replies(2): >>45002424 #>>45002527 #

28. BenderV ◴[24 Aug 25 07:27 UTC] No.45002110[source]▶

>>45001234 #

Nice but sad to see lack of tools. Most your code is about the agent framework instead of specific to SWE.

I've built a SWE agent too (for fun), check it out => https://github.com/myriade-ai/autocode

replies(1): >>45002134 #

29. cryptoz ◴[24 Aug 25 07:29 UTC] No.45002122{3}[source]▶

>>45001959 #

Oh I don’t use Gemini! I did try it out and admittedly formed an opinion too narrow on cli agents. But no way do I actually use Gemini.

30. johannesboyne ◴[24 Aug 25 07:30 UTC] No.45002125[source]▶

>>45001051 (OP) #

A very similar "how to guide" can be found here https://ampcode.com/how-to-build-an-agent written by Thorsten Ball. In general Amp is quite interesting - obviously no hidden gem anymore ;-) but great to see more tooling around agentic coding being published. Also, because similar agentic-approaches will be part of (certain/many?) software suits in the future.

replies(3): >>45002245 #>>45002415 #>>45006101 #

31. diminish ◴[24 Aug 25 07:32 UTC] No.45002134{3}[source]▶

>>45002110 #

> sad to see lack of tools.

Lack of tools in mini-swe-agent is a feature. You can run it with any LLM no matter how big or small.

replies(1): >>45002821 #

32. faangguyindia ◴[24 Aug 25 07:32 UTC] No.45002135[source]▶

>>45001738 #

>Why are any of the tools beyond the bash tool required?

My best guess is they started out with a limited subset of tools and realised they can just give it bash later.

33. the_mitsuhiko ◴[24 Aug 25 07:43 UTC] No.45002208[source]▶

>>45001738 #

Technically speaking, you can get away with just a Bash tool, and I had some success with this. It's actually quite interesting to take away tools from agents and see how creative they are with the use.

One of the reasons why you get better performance if you give them the other tools is that there has been some reinforcement learning on Sonne with all these tools. The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions. The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.

replies(3): >>45002420 #>>45002638 #>>45004117 #

34. campbellbell ◴[24 Aug 25 07:51 UTC] No.45002245[source]▶

>>45002125 #

Makes sense, the author says he also works at Amp

35. nivertech ◴[24 Aug 25 07:58 UTC] No.45002285{3}[source]▶

>>45001548 #

  system_template: str = "You are a helpful assistant that can do anything."

anything? Sounds like an AI Safety issue ;)

replies(1): >>45004257 #

36. kissgyorgy ◴[24 Aug 25 08:03 UTC] No.45002306[source]▶

>>45001738 #

This is explained in 3.2 How to design good tools?

    This saves the LLM from having to do multiple low level clicking and typing and keeps it on track. Help the poor model out, will ya!?

replies(1): >>45002609 #

37. hobofan ◴[24 Aug 25 08:06 UTC] No.45002316[source]▶

>>45001051 (OP) #

I hate to do meta-commentary (the content is a decent beginner level introduction to the topic!), but this is some of the worst AI-slop-infused presentation I've seen with a blog post in a while.

Why the unnecessary generated AI pictures in between?

Why put everything that could have been a bullet point into it's own individual picture (even if it's not AI generated)? It's very visually distracting, breaks the flow of reading, and it's less accessible as all the picture lack alt-text.

---

I see that it's based on a conference talk, so it's possibly just 1:1 the slides. If that's the case please put it up in it's native conference format, rather than this.

replies(2): >>45002323 #>>45002338 #

38. bambax ◴[24 Aug 25 08:07 UTC] No.45002323[source]▶

>>45002316 #

Agreed. It's unreadable.

39. Tewboo ◴[24 Aug 25 08:10 UTC] No.45002336[source]▶

>>45001051 (OP) #

Building a coding agent involves defining clear goals, leveraging AI, and iterating based on feedback. Start with a simple task and scale up.

replies(1): >>45003692 #

40. gregrata ◴[24 Aug 25 08:10 UTC] No.45002338[source]▶

>>45002316 #

Wow. Yeah. That's unreadable - my frustration and annoyance levels got high fast, had to close the page before I went for the power button on my machine :)

41. lifthrasiir ◴[24 Aug 25 08:21 UTC] No.45002398{3}[source]▶

>>45001711 #

I'm not sure what do you mean by "whole file format", but if it refers to the write_file tool that overwrites the whole file, there is also the replace tool which is apparently inspired by a blog post [1] by Anthropic. It seems that Claude Code also supports the roughly identical tool (inferred from error messages), so editing tools can't be the reason why Claude Code is good.

[1] https://www.anthropic.com/engineering/swe-bench-sonnet

replies(1): >>45003034 #

42. manojlds ◴[24 Aug 25 08:24 UTC] No.45002415[source]▶

>>45002125 #

Ghuntley also works at Amp

replies(1): >>45003617 #

43. dotancohen ◴[24 Aug 25 08:25 UTC] No.45002420{3}[source]▶

>>45002208 #

  > The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.

This was the only informative sentence in the reply. Can you please go on in this manner - it was an important question.

44. dotancohen ◴[24 Aug 25 08:26 UTC] No.45002424[source]▶

>>45002107 #

You wouldn't.

This project and this post are for the curious and for the learners.

45. akk0 ◴[24 Aug 25 08:34 UTC] No.45002466[source]▶

>>45001051 (OP) #

If a picture is usually worth 1000 words, the pictures in this are on a 99.6% discount. What the actual...?

replies(1): >>45002500 #

46. ghuntley ◴[24 Aug 25 08:39 UTC] No.45002500[source]▶

>>45002466 #

It's a conference workshop; these are the slides from the workshop, and the words are a dictation from the delivery.

replies(1): >>45002594 #

47. ghuntley ◴[24 Aug 25 08:43 UTC] No.45002527[source]▶

>>45002107 #

Knowing how to build your own agent and what that loop is going to be the new whiteboard coding question in a couple of years. Absolute. It's going to be the same as "Reverse this string", "I've got a linked list, can you reverse it?", or "Here's my graph, can you traverse it?"

replies(1): >>45002836 #

48. akk0 ◴[24 Aug 25 08:57 UTC] No.45002594{3}[source]▶

>>45002500 #

That seems like a leaky implementation detail to me, for a published piece.

replies(2): >>45003491 #>>45020768 #

49. normie3000 ◴[24 Aug 25 09:00 UTC] No.45002609{3}[source]▶

>>45002306 #

I'm not sure where this quote is from - it doesn't seem to appear in the linked article.

replies(1): >>45004672 #

50. normie3000 ◴[24 Aug 25 09:02 UTC] No.45002621{3}[source]▶

>>45001930 #

> you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.

This is a very strong argument for more specific tools, thanks!

51. anonzzzies ◴[24 Aug 25 09:04 UTC] No.45002633[source]▶

>>45001051 (OP) #

what's the best current cli (with a non interactive option) that is on par with Claude code but can work with other llms like ollama, openrouter etc? I tried stuff like aider but it cannot discover files, the open source gemini one but it was terrible; what is a good one that maybe is the same as CC if you plug in Opus?

replies(2): >>45002651 #>>45003634 #

52. normie3000 ◴[24 Aug 25 09:05 UTC] No.45002638{3}[source]▶

>>45002208 #

> The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions.

Interesting! This didn't seem to be the case in the OP's examples - for instance using a list_files tool and then checking if the json result included README vs bash [ -f README ]

replies(1): >>45004841 #

53. akdev1l ◴[24 Aug 25 09:07 UTC] No.45002651[source]▶

>>45002633 #

Haven’t tried many but the LLM cli seems alright to me

54. normie3000 ◴[24 Aug 25 09:10 UTC] No.45002679{3}[source]▶

>>45001984 #

To me a better analogy would be: if you're a household of 2 who own 3 reliable cars, why would you need a 4th car with smaller cargo & passenger capacities, higher fuel consumption, worse off-road performance and lower top speed?

55. fullstackwife ◴[24 Aug 25 09:11 UTC] No.45002681[source]▶

>>45001051 (OP) #

Exactly my approach to gaining knowledge and learning through building your own(`npx genaicode`). When I was presenting my work on a local meetup I got this exact question: "why u building this instead of just using Cursor". The answer is explained in this article(tl;dr; transformative experience), even though some parts of it are already outdated or will be outdated very soon as the technology is making progress every day.

replies(1): >>45003622 #

56. sireat ◴[24 Aug 25 09:20 UTC] No.45002729{3}[source]▶

>>45001548 #

Pretty sure you also need about 120 lines of prompting from default.yaml

https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...

57. BenderV ◴[24 Aug 25 09:38 UTC] No.45002821{4}[source]▶

>>45002134 #

I'm trying to understand what does it got to do with LLM size? Imho, right tools allow small models to perform better than undirected tool like bash to do everything. But I understand that this code is to show people how function calling is just a template for LLM.

replies(1): >>45003155 #

58. prodimmune ◴[24 Aug 25 09:41 UTC] No.45002836{3}[source]▶

>>45002527 #

I see, thanks. I was wondering earlier if there would be any practical advantage in creating a custom agent, but couldn't think of any. I guess I simply misunderstood the purpose of your post.

59. ghuntley ◴[24 Aug 25 10:21 UTC] No.45003009[source]▶

>>45001717 #

Sonnet does this via the edit tool and bash tool. It’s inbuilt to the model.

replies(1): >>45003144 #

60. faangguyindia ◴[24 Aug 25 10:27 UTC] No.45003034{4}[source]▶

>>45002398 #

Many agents can send diffs. Whole file reading and writing burns tokens and pollutes context.

replies(1): >>45003515 #

61. digitcatphd ◴[24 Aug 25 10:43 UTC] No.45003103[source]▶

>>45001051 (OP) #

The problem I have with this is that this style of agent design, providing enormous autonomy, makes sense in coding while keeping an expert human in the loop since it can self-correct via debugging. What would the other use cases of giving an agent this much autonomy be today versus a more structured flow versus something more like LangGraph?

62. revskill ◴[24 Aug 25 10:51 UTC] No.45003144{3}[source]▶

>>45003009 #

Interesting.

replies(1): >>45003649 #

63. diminish ◴[24 Aug 25 10:52 UTC] No.45003155{5}[source]▶

>>45002821 #

Mini swe agent, as an academic tool, can be easily tested aimed to show the power of a simple idea against any LLM. You can go and test it with different LLMs. Tool calls didn't work fine with smaller LLM sizes usually. I don't see many viable alternatives less than 7GB, beyond Qwen3 4B for tool calling.

> right tools allow small models to perform better than undirected tool like bash to do everything.

Interesting enough the newer mini swe agent was refutation of this hypothesis for very large LLMs from the original swe agent paper (https://arxiv.org/pdf/2405.15793) assuming that specialized tools work better.

replies(1): >>45011950 #

64. _pdp_ ◴[24 Aug 25 11:34 UTC] No.45003421[source]▶

>>45001051 (OP) #

Very simplistic view on the problem domain IMHO. Yah sure we can add a bunch of functions... ok. But how about snapshotting (or at least work with git), sandboxing both process and network level, prompt engineering, detect when stuck, model switching with parallel solvers for better solutions. These are the kind of things that make coding agents reliable - not function declarations.

replies(1): >>45003598 #

65. user3939382 ◴[24 Aug 25 11:41 UTC] No.45003469[source]▶

>>45001051 (OP) #

The trick with coding agent is guiding the attention towards tasks it can expect will fit in the agent’s token window and deciding when to delegate. Funny as a PM you have the exact problem.

replies(1): >>45003687 #

66. mg74 ◴[24 Aug 25 11:45 UTC] No.45003491{4}[source]▶

>>45002594 #

You should learn to be grateful for what other people do on their own time and demand nothing from you for you to benefit from it.

replies(3): >>45003608 #>>45004294 #>>45004884 #

67. lifthrasiir ◴[24 Aug 25 11:51 UTC] No.45003515{5}[source]▶

>>45003034 #

The replace tool is a form of diff (although it's rudimentary), and the read_file tool can be called with line ranges. I do wish robust patching but it is not the "whole" file reading/writing. Maybe you wanted to say about subagent file handling? I can agree then.

(Also I think Gemini is significantly better when it comes to the context rot, in my experience 100K--300K tokens were required for symptoms to appear. So burning tokens is less problematic with Gemini.)

68. codingdave ◴[24 Aug 25 12:02 UTC] No.45003577[source]▶

>>45001051 (OP) #

> You just keep throwing tokens at the loop, and then you've got yourself an agent.

Money. Replace "tokens" with "money". You just keep throwing money at the loop, and then you've got yourself an agent.

replies(1): >>45003607 #

69. ghuntley ◴[24 Aug 25 12:06 UTC] No.45003598[source]▶

>>45003421 #

It will be included as part of the third instalment. I write these coding agents for a living. Need to start with the basics as the basics is what people need to know to be able to automate functions at their employer, which may not be coding agents. This workshop was delivered at a data engineering conference, for example.

70. ghuntley ◴[24 Aug 25 12:07 UTC] No.45003607[source]▶

>>45003577 #

Who says that tokens are money? Local models are getting really good. For now, yes, if you want the best outcomes, you need to purchase tokens. But in the future, that may not be the case.

replies(2): >>45003713 #>>45004572 #

71. ghuntley ◴[24 Aug 25 12:07 UTC] No.45003608{5}[source]▶

>>45003491 #

Thanks, mate.

72. ghuntley ◴[24 Aug 25 12:08 UTC] No.45003615[source]▶

>>45001998 #

I'd love to see you build your own agent and then share it here in HN as a show HN.

replies(3): >>45003859 #>>45003910 #>>45004125 #

73. ghuntley ◴[24 Aug 25 12:08 UTC] No.45003617{3}[source]▶

>>45002415 #

Yes

74. ghuntley ◴[24 Aug 25 12:09 UTC] No.45003622[source]▶

>>45002681 #

Exactly, dude. This is the most important thing, the fundamentals to understand how this stuff works under the hood. I don't get how people aren't curious. Why aren't people being engineers? This is one of the most transformative things to happen in our profession in the last 20 years.

75. ghuntley ◴[24 Aug 25 12:11 UTC] No.45003634[source]▶

>>45002633 #

Opencode is pretty good and likely meets your needs. One thing I'll call out is Gemini is terrible as an agent currently because Gemini is not a very good tool calling LLM. It's an oracle. https://ghuntley.com/cars/

76. ghuntley ◴[24 Aug 25 12:13 UTC] No.45003649{4}[source]▶

>>45003144 #

Keep an eye out for Sonnet generating Python files. What typically happens is: let's say you had a refactor that needs to happen, and let's say 100 symbols need renaming. Instead of invoking the edit tool 100 times, Sonnet has this behaviour where it will synthesise a Python program and then execute it to do it all in one shot.

replies(1): >>45003885 #

77. ghuntley ◴[24 Aug 25 12:16 UTC] No.45003670[source]▶

>>45001738 #

Yeah, you could get away with a coding agent just using the Bash tool and the Edit tool (tbh somewhat optional but not having it would be highly inefficient). I haven't tried it, but it might struggle with the code search functionality. It would be possible with the right prompting. For example, you could just prompt the LLM to say "If you need to search the source code, use ripgrep with the Bash tool."

replies(1): >>45006289 #

78. ghuntley ◴[24 Aug 25 12:18 UTC] No.45003687[source]▶

>>45003469 #

Yep. What you need to do is set its direction and then blow wind into its sails.

79. ghuntley ◴[24 Aug 25 12:19 UTC] No.45003692[source]▶

>>45002336 #

Yep, once you've got the base coding agent (as in the workshop above), you can use it to build another agent or anything really. You start from that kernel and you can bootstrap upwards from that point forward and build anything.

80. ghuntley ◴[24 Aug 25 12:20 UTC] No.45003705{4}[source]▶

>>45001675 #

> making all this fault tolerant with good ux is lot of work.

Yes, it is. Not only in the department of good design in UX, but these LLMs keep evolving. They are software with different versions, and these different versions are continually deployed, which changes the behavior of the underlying model. So the harness needs to be continually updated to remain competitive.

81. rvz ◴[24 Aug 25 12:22 UTC] No.45003713{3}[source]▶

>>45003607 #

> Local models are getting really good.

They are great for basic tasks like summarization and translation but for the best results from coding agents and from 90% of so-called AI startups who are using these APIs, they are all purchasing tokens.

No different to operating a slot-machine towards vibe-coders who are the AI companies favourite type of customer - spending endless amounts of money on tokens for another spin at fixing an error they don't understand.

82. ahrjay ◴[24 Aug 25 12:35 UTC] No.45003787[source]▶

>>45001051 (OP) #

I had a go at this using the on-device models in edge and chrome, phi4-mini and gemini nano, worked surprisingly well for such small models.

https://ryanseddon.com/ai/how-to-build-an-agent-on-device/

83. chrisweekly ◴[24 Aug 25 12:47 UTC] No.45003859{3}[source]▶

>>45003615 #

Thank you for sharing.

And remember to avoid feeding the trolls.

84. gbrindisi ◴[24 Aug 25 12:50 UTC] No.45003885{5}[source]▶

>>45003649 #

I wonder how far I could go with a barebone agent prompted to take advantage of this with Sonnet and the Bash tool only, so that it will always try to use the tool to only do `python -c …`

85. jsjdkdkdkfk ◴[24 Aug 25 12:55 UTC] No.45003910{3}[source]▶

>>45003615 #

You haven't built anything. You are just a grifter spinning words in desperate need for attention. No one will ever use your "product" because it's useless. You know this and yet you keep trying to hustle the ignorant. Keep boosting yourself with alts.

86. russfink ◴[24 Aug 25 12:57 UTC] No.45003930[source]▶

>>45001051 (OP) #

Can someone please explain the axes: Oracle, Agent, high safety and low safety?

87. apwell23 ◴[24 Aug 25 12:58 UTC] No.45003932[source]▶

>>45001051 (OP) #

how does opencode use my claude code subscription instead of making me use api key.

thats one of the things stoppng me from rolling my own. having to use pay per use api.

replies(2): >>45004805 #>>45006068 #

88. wslomo ◴[24 Aug 25 13:29 UTC] No.45004115[source]▶

>>45001051 (OP) #

The “valley of it will take our jobs” is an approaching light in the train tunnel.

I live in the “valley”. I battle depression daily that I had before LLMs.

Using LLMs and false guardrails to watchdog inherently deceitful output is a bad system smell.

I know most are “on it”, and I’ve written a coding agent.

But why is this page designed like some brainwashing repetitive Orwellian mantra?

If it’s perceived that we need that, then we’re having to overcome something, and that something is common sense.

So maybe we’ll happily write our coding agents with the intent to stand on the shoulders of a giant.

But everyone knows we’re building the technological equivalent of a crystal meth empire.

replies(1): >>45004560 #

89. ◴[24 Aug 25 13:29 UTC] No.45004117{3}[source]▶

>>45002208 #

90. nickthegreek ◴[24 Aug 25 13:30 UTC] No.45004125{3}[source]▶

>>45003615 #

has this agent fully built anything? that is a pretty straight forward question that you should be expected to answer when submitting something like this to HN.

91. greleic ◴[24 Aug 25 13:52 UTC] No.45004257{4}[source]▶

>>45002285 #

You’d be surprised at the amount of time wasted because LLMs “think” they can’t do something. You’d be less surprised that they often “think” they can’t do something, but choose some straight ignorant path that cannot work.

There are theoretically impossible things to do, if you buy into only the basics. If you open your mind, anything is achievable; you just need to break out of the box you’re in.

If enough people keep feeding in that we need a time machine, the revolution will play out in all the timelines. Without it, Sarah Connor is lost.

replies(1): >>45008927 #

92. crazygringo ◴[24 Aug 25 13:58 UTC] No.45004294{5}[source]▶

>>45003491 #

On the other hand, when we critique, it is for the benefit of everyone who reads the critique and learns from it.

That's why critique has value. To the original author/artist (if they see it), but also to everyone else who sees it. "Oh, I was going to intersperse text slides with a transcript, but I remember how offputting that was once on HN, so let's skip the slides."

replies(1): >>45008705 #

93. losvedir ◴[24 Aug 25 14:27 UTC] No.45004504[source]▶

>>45001051 (OP) #

Can someone confirm my understanding of how tool use works behind the scenes? Claude, ChatGPT, etc, through the API offer "tools" and give responses that ask for tool invocations which you then do and send the result back. However, the underlying model is a strictly text based medium, so I'm wondering how exactly the model APIs are turning the model response into these different sort of API responses. I'm assuming there's been a fine-tuning step with lots of examples which put desired tool invocations into some sort of delineated block or something, which the Claude/ChatGPT server understand? Is there any documentation about how this works exactly, and what those internal delineation tokens and such are? How do they ensure that the user text doesn't mess with it and inject "semantic" markers like that?

replies(3): >>45004657 #>>45004890 #>>45005147 #

94. ITB ◴[24 Aug 25 14:34 UTC] No.45004560[source]▶

>>45004115 #

So what do you want to do about it? It can’t be stopped.

replies(1): >>45005030 #

95. codingdave ◴[24 Aug 25 14:35 UTC] No.45004572{3}[source]▶

>>45003607 #

I'd argue that local models still cost money, albeit less than the vendors would cost. Unless you happen to live off-grid and get your own electricity for free. I suppose there are free tiers available that work for some things as well.

But with edge-case exceptions aside, yes, tokens cost money.

replies(1): >>45020742 #

96. jedimastert ◴[24 Aug 25 14:49 UTC] No.45004657[source]▶

>>45004504 #

Here's some docs from anthropic about their implementation

https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...

The disconnect here is that models aren't really "text" based, but token based, like how compilers don't use the code itself but a series of tokens that can include keywords, brackets, and other things. The output can include words but also metadata

97. kissgyorgy ◴[24 Aug 25 14:51 UTC] No.45004672{4}[source]▶

>>45002609 #

ahh, sorry, different article :(

98. esafak ◴[24 Aug 25 15:09 UTC] No.45004805[source]▶

>>45003932 #

It works with Claude Pro/Max subscriptions. https://opencode.ai/docs/#configure

99. the_mitsuhiko ◴[24 Aug 25 15:13 UTC] No.45004841{4}[source]▶

>>45002638 #

> Interesting! This didn't seem to be the case in the OP's examples - for instance using a list_files tool and then checking if the json result included README vs bash [ -f README ]

There is no training on a tool with that name. But it likely also doesn't need training because the parameter is just a path and that's a pretty basic tool.

On the other hand to know how to execute a bash command, you need to know bash. Bash is a known tool to the Claude models [1] and so is text editing [2]. You're supposed to reference those in the tool listing but at least from my testing, the moment you call a tool "bash", Claude makes plenty of assumptions about what the point of this thing is.

[1]: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...

[2]: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...

100. aiaizbba ◴[24 Aug 25 15:19 UTC] No.45004884{5}[source]▶

>>45003491 #

The article in question is written by a person who stands to benefit heavily from AI & agents succeeding. They appear to be a true believer so this isn’t a disparaging comment, but a snake oil salesman would display the same behavior.

101. the_mitsuhiko ◴[24 Aug 25 15:19 UTC] No.45004890[source]▶

>>45004504 #

> I'm assuming there's been a fine-tuning step with lots of examples which put desired tool invocations into some sort of delineated block or something, which the Claude/ChatGPT server understand?

As far as I know that's what's happening. They are training it to return tool responses when it's unsure about the answer or instructed to do so. There are generic tool trainings for just following the response format, and then probably there are some tool specific trainings. For instance gpt-oss loves to use the search tool, even if it's not mentioned anywhere. Anthropic lists well known tools in their document (eg: text_editor, bash). They are likely to have been trained specifically to follow some deeper semantics compared to just generic tool usage.

The whole thing is pretty brittle and tool invocations are just taking place via in-band signalling, delineated by special tokens or token sequences.

102. bgwalter ◴[24 Aug 25 15:35 UTC] No.45005030{3}[source]▶

>>45004560 #

Even positive things like nuclear energy have been stopped in Germany for example, against an industry lobby that was almighty in the 1980s.

Negative things like IP stealing "AI" can be stopped as well, and the population is increasingly watchful and will organize itself at some point.

103. libraryofbabel ◴[24 Aug 25 15:47 UTC] No.45005147[source]▶

>>45004504 #

You have the right picture of what’s going on. Roughly:

* The only true interface with an LLM is tokens. (No separation between control and data channels.)

* The model api layer injects instructions on tool calling and a list of available tools into the base prompt, with documentation on what those tools do.

* Tool calling is delineated by special tokens. When a model wants to call a tool, it adds a special block to the response that contains the magic token(s) along with the name of the tool and any params. The api layer then extracts this and forms a structured json response in some tool_calls parameter or whatever that is sent in the api response to the user. The result of the tool coming back from the user through the tool calling api is then encoded with special tokens and injected.

* Presumably, the api layer prevents the user from injecting such tokens themselves.

* SotA Models are good at tool calls because they have been heavily fine-tuned on them, with all sorts of tasks that involve tool calls, like bash invocations. The fine-tuning is both to get them good at tool calls in general, and also probably involves specific tool calls that the model provider wants them to be good at, such as the Claude Sonnet model getting fine-tuned on the specific tools Claude Code uses.

Sometimes it amazes me that this all works so well, but it does. You are right to put your finger on the fine-tuning, as it’s critical for making tool calling work well. Tool calling works without fine-tuning, but it’s going to be more hit-or-miss.

104. dangoodmanUT ◴[24 Aug 25 15:58 UTC] No.45005252[source]▶

>>45001051 (OP) #

All these images make it impossibly hard to read... gd scroll simulator

105. loquisgon ◴[24 Aug 25 16:40 UTC] No.45005646[source]▶

>>45001051 (OP) #

Nitpicking. What the author calls sequence diagrams are not that. They are flowcharts.

106. popcorncowboy ◴[24 Aug 25 17:35 UTC] No.45006068[source]▶

>>45003932 #

It simply spoofs itself as Claude Code when calling the API. Anthropic will shut this down the second it benefits them to do so. Like much of the gravy train right now, enjoy it while it lasts.

107. Revisional_Sin ◴[24 Aug 25 17:39 UTC] No.45006101[source]▶

>>45002125 #

This looks much better, thank you.

108. normie3000 ◴[24 Aug 25 18:02 UTC] No.45006289{3}[source]▶

>>45003670 #

> Edit tool (tbh somewhat optional but not having it would be highly inefficient)

If you need to edit the source, just use patch with the bash tool.

What's the efficiency issue?

109. helsinki ◴[24 Aug 25 18:07 UTC] No.45006316[source]▶

>>45001051 (OP) #

“will transfirm you from being a consumer to a producer”

You mean, “will teach you to make REST calls instead of letting a script do it for you”.

We are all consumers. Unless, of course, you work at one of three companies.

110. afro88 ◴[24 Aug 25 19:30 UTC] No.45006960{3}[source]▶

>>45001737 #

Does this mean it's only useful for issue fixes?

replies(1): >>45008077 #

111. regularfry ◴[24 Aug 25 21:43 UTC] No.45008077{4}[source]▶

>>45006960 #

A feature is just an issue. The issue is that the feature isn't complete yet.

replies(1): >>45012539 #

112. jsnell ◴[24 Aug 25 23:20 UTC] No.45008705{6}[source]▶

>>45004294 #

Text slides interspersed with the transcript work great normally.

But this is mostly slides with just four words of content, and a transcript that just be repeats the same words.

113. curvaturearth ◴[24 Aug 25 23:59 UTC] No.45008927{5}[source]▶

>>45004257 #

I'm already surprised by the amount of things they think they can do but can't

114. overgard ◴[25 Aug 25 02:50 UTC] No.45009742[source]▶

>>45001051 (OP) #

I'd love for this kind of thing without the incredibly obnoxious commentary. I felt like I was reading propaganda rather than a how-to.

115. zhlmmc ◴[25 Aug 25 03:02 UTC] No.45009789[source]▶

>>45001234 #

Totally understandable. General coding agent is 95% from the model.

116. pnt12 ◴[25 Aug 25 09:18 UTC] No.45011886[source]▶

>>45001051 (OP) #

Late to the party, but thanks to the author for this: I learned a lot from this article, although I have mixed feelings about it.

The good: cool to know more about the agents loops, different types of LLMs, ideas for prompting. I definitely wanna try it - would be cool to prompt the agent to build some feature, have it in a loop of building, testing, reviewing and, go have breakfast, come back and only have to tweak a reasonably legible and working code.

The bad: some of these concepts, maybe they are bit meant to mislead, buy really trigger my 'snake oil alert'. The AI compass? Agentic VS non agentic LLMs? People who are getting work done between meetings? Maybe this is more of a vibe thing, so it's not trivial / logical to explain, but in this space there's so many loosely defined concepts that that really trigger skepticism in me (and others).

The ugly: 1 word slides ;p

117. BenderV ◴[25 Aug 25 09:29 UTC] No.45011950{6}[source]▶

>>45003155 #

Thanks for your answer.

I guess that it's only a matter of finetuning.

LLM have lots of experience with bash so I get they figure out how to work with it. They don't have experience with custom tools you provide it.

And also, LLM "tools" as we know it need better design (to show states, dynamic actions).

Given both, AI with the right tools will outperform AI with generic and uncontrolled tool.

118. afro88 ◴[25 Aug 25 11:04 UTC] No.45012539{5}[source]▶

>>45008077 #

> 2. Create a script to reproduce the issue

Surely that would send it a bit off the rails to implement a feature?

replies(1): >>45014461 #

119. regularfry ◴[25 Aug 25 14:47 UTC] No.45014461{6}[source]▶

>>45012539 #

Sounds like an acceptance test to me!

replies(1): >>45018294 #

120. afro88 ◴[25 Aug 25 20:00 UTC] No.45018294{7}[source]▶

>>45014461 #

True. I guess I should actually try it out :)

121. JamesSwift ◴[26 Aug 25 00:10 UTC] No.45020742{4}[source]▶

>>45004572 #

> Unless you happen to live off-grid and get your own electricity for free

I just had to laugh and link this other article by the author https://ghuntley.com/internet/

Im not sure if he has a solar array but I assume so?

122. JamesSwift ◴[26 Aug 25 00:13 UTC] No.45020768{4}[source]▶

>>45002594 #

I look forward to anything I see ghuntley post, and Ive gotten a lot of benefit from his contributions especially in the .net space, but I have to agree here. The format was distracting and really affected my ability to read it. I havent seen the live version, but I have a feeling the slide deck format would also be distracting (2-4 words per slide is really not communicating ideas effectively when you need all the words to form the sentence).

123. mh27 ◴[27 Aug 25 18:52 UTC] No.45043504[source]▶

>>45001051 (OP) #

Can anyone share what font is used in the slides? It's lovely

↑