Most active commenters
  • CuriouslyC(7)

←back to thread

280 points zachwills | 33 comments | | HN request time: 0.001s | source | bottom
1. CuriouslyC ◴[] No.45229400[source]
As someone who's built a project in this space, this is incredibly unreliable. Subagents don't get a full system prompt (including stuff like CLAUDE.md directions) so they are flying very blind in your projects, and as such will tend to get derailed by their lack of knowledge of a project and veer into mock solutions and "let me just make a simpler solution that demonstrates X."

I advise people to only use subagents for stuff that is very compartmentalized because they're hard to monitor and prone to failure with complex codebases where agents live and die by project knowledge curated in files like CLAUDE.md. If your main Claude instance doesn't give a good handoff to a subagent, or a subagent doesn't give a good handback to the main Claude, shit will go sideways fast.

Also, don't lean on agents for refactoring. Their ability to refactor a codebase goes in the toilet pretty quickly.

replies(5): >>45229506 #>>45229671 #>>45230608 #>>45230768 #>>45230775 #
2. theshrike79 ◴[] No.45229506[source]
I don't use subagents to do things, they're best for analysing things.

Like "evaluate the test coverage" or "check if the project follows the style guide".

This way the "main" context only gets the report and doesn't waste space on massive test outputs or reading multiple files.

replies(1): >>45229574 #
3. olivermuty ◴[] No.45229574[source]
This is only a problem if an agent is made in a lazy way (all of them).

Chat completion sends the full prompt history on every call.

I am working on my own coding agent and seeing massive improvements by rewriting history using either a smaller model or a freestanding call to the main one.

It really mitigates context poisoning.

replies(3): >>45229616 #>>45229701 #>>45230376 #
4. mattmanser ◴[] No.45229616{3}[source]
Everyone complains that when you compact the context, Claude tends to get stupid

Which as far as I understand it is summarizing the context with a smaller model.

Am I misunderstanding you, as the practical experience of most people seem to contradict your results.

replies(1): >>45230007 #
5. zarzavat ◴[] No.45229671[source]
> Their ability to refactor a codebase goes in the toilet pretty quickly.

Very much this. I tried to get Claude to move some code from one file to another. Some of the code went missing. Some of it was modified along the way.

Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.

Claude can only reliably do this refactoring if it can keep the start and end files in context. This was a large file, so it got lost. Even then it needs direct supervision.

replies(4): >>45230552 #>>45231336 #>>45231629 #>>45232104 #
6. CuriouslyC ◴[] No.45229701{3}[source]
There's a large body of research on context pruning/rewriting (I know because I'm knee deep in benchmarks in release prep for my context compiler), definitely don't ad hoc this.
replies(1): >>45230798 #
7. NitpickLawyer ◴[] No.45230007{4}[source]
One key insight I have from having worked on this from the early stages of LLMs (before chatgpt came out) is that the current crop of LLM clients or "agentic clients" don't log/write/keep track of success over time. It's more of a "shoot and forget" environment right now, and that's why a lot of people are getting vastly different results. Hell, even week to week on the same tasks you get different results (see the recent claude getting dumber drama).

Once we start to see that kind of self feedback going in next iterations (w/ possible training runs between sessions, "dreaming" stage from og RL, distilling a session, grabbing key insights, storing them, surfacing them at next inference, etc) then we'll see true progress in this space.

The problem is that a lot of people work on these things in silos. The industry is much more geared towards quick returns now, having to show something now, rather than building strong fo0undations based on real data. Kind of an analogy to early linux dev. We need our own Linus, it would seem :)

replies(3): >>45230079 #>>45230179 #>>45232239 #
8. troupo ◴[] No.45230079{5}[source]
> don't log/write/keep track of success over time.

How do you define success of a model's run?

replies(1): >>45230217 #
9. ako ◴[] No.45230179{5}[source]
I’ve experimented with feature chats, so start a new chat for every change, just like a feature branch. At the end of a chat I’ll have it summarize the the feature chat and save it as a markdown document in the project, so the knowledge is still available for next chats. Seems to work well.

You can also ask the llm at the end of a feature chat to prepare a prompt to start the next feature chat so it can determine what knowledge is important to communicate to the next feature chat.

Summarizing a chat also helps getting rid of wrong info, as you’ll often trial and error towards the right solution. You don’t want these incorrect approaches to leak into the context of the next feature chat, maybe just add the “don’t dos” into a guidelines and rules document so it will avoid it in the future.

replies(2): >>45230805 #>>45232007 #
10. NitpickLawyer ◴[] No.45230217{6}[source]
Lots of ways. You could do binary thumbs up/down. You could do a feedback session. You could look at signals like "acceptance rate" (for a pr?) or "how many feedback messages did the user send in this session", and so on.

My point was more on tracking these signals over time. And using them to improve the client, not just the model (most model providers probably track this already).

replies(1): >>45231673 #
11. ixsploit ◴[] No.45230376{3}[source]
I do something similar and I have the best results of not having a history at all, but setting the context new with every invokation.
12. diggan ◴[] No.45230552[source]
> Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.

For my own agent I have a `move_file` and `copy_file` tool with two args each, that at least GPT-OSS seems to be able to use whenever it suits, like for moving stuff around. I've seen it use it as part of refactoring as well, moving a file to one location, copying that to another, the trim both of them but different trims, seems to have worked OK.

If the agent has access to `exec_shell` or similar, I'm sure you could add `Use mv and cp if you need to move or copy files` to the system prompt to get it to use that instead, probably would work in Claude Code as well.

13. prash2488 ◴[] No.45230608[source]
Totally agreed, tried agents for a lot of stuff (I started creating a team of agents, architect, frontend coder, backend coder and QA). Spent around 50 USD on a failed project, context contaminated and the project eventually had to be re-written.

Then I moved some parts in rules, some parts in slash commands and then I got much better results.

The subagents are like a freelance contractors (I know, I have been one very recently) Good when they need little handoff (Not possible in realtime), little overseeing and their results are a good advice not an action. They don't know what you are doing, they don't care what you do with the info they produce. They just do the work for you while you do something else, or wait for them to produce independent results. They come and go with little knowledge of existing functionalities, but good on their own.

Here are 3 agents I still keep and one I am working on.

1: Scaffolding: Now I create (and sometimes destroy) a lot of new projects. I use a scaffolding agents when I am trying something new. They start with fresh one line instruction to what to scaffold (e.g. a New docker container with Hono and Postgres connection, or a new cloudflare worker which will connect to R2, D1 and AI Gateway, or a AWS Serverless API Gateway with SQS that does this that and that), where to deploy. At the end of the day they setup the project with structure, create a Github Repo and commit it for me. I will take it forward from them

2: Triage: When I face some issues which is not obvious from reading code alone, I give them the place, some logs and the agent will use whatever available (including the DB Data) to make a best guess of why this issue happens. I often found out they work best when they are not biased by recent work

3: Pre-Release Check QA: Now this QA will test the entire system (Essentially calling all integration and end-to-end test suite to make sure this product doesn't break anything existing. Now I am adding a functionality to let them see original business requirement and see if the code satisfies it or not. I want this agent to be my advisor to help me decide if something goes to release pipeline or not.

4: Web search (Experimental) Sometimes, some search are too costly for existing token, and we only need the end result, not what they search and those 10 pages it found out...

14. stingraycharles ◴[] No.45230768[source]
It was my understanding that the subagents have the same system prompt. How do you know that they don’t follow CLAUDE.md directions?

I’ve been using subagents since they were introduced and it has been a great way to manage context size / pollution.

replies(1): >>45232144 #
15. quijoteuniv ◴[] No.45230775[source]
My experience so far, after trying to keep CC on track with different strategies is that it will more or less end up on the same ditch sooner or later. Even though i had defined agents, workflows, etc. now i just let it interact with github issues and the quality is pretty much the same
16. spariev ◴[] No.45230798{4}[source]
Care to give some pointers on what to look at? Looks like I will be doing something similar soon so that would be much appreciated
replies(1): >>45232183 #
17. rufasterisco ◴[] No.45230805{6}[source]
i too have discovered that feature chats are surely a winner (as well as a pre-requirement for parallelization)

in a similar vein, i match github project issues to md files committed to repo

essentially, the github issue content is just a link to the md file in the repo also, epics are folders with links (+ a readme that gets updated after each task)

i am very happy about it too

it's also very fast and handy to reference either from claude using @ .ie: did you consider what has been done @

other major improvements that worked for me were - DOC_INDEX.md build around the concept of "read this if you are working on X (infra, db, frontend, domain, ....)" - COMMON_TASKS.md (if you need to do X read Y, if you need to add a new frontend component read HOW_TO_ADD_A_COMPONENT.md )

common tasks tend to be increase quality when they are epxpressed in a checklist format

18. lupire ◴[] No.45231336[source]
Remember 20 years ago when Eclipse could move a function by manipulating the AST and following references to adjust imports and callers, and it it didn't lose any code?
replies(3): >>45232292 #>>45233943 #>>45234165 #
19. wahnfrieden ◴[] No.45231629[source]
Codex’s model is much better at actually reading large volumes of code which improves its results compared with CC
20. troupo ◴[] No.45231673{7}[source]
Ah. Yes!

My somewhat terse/bitter question was because yesterday Claude would continue claim to have created a "production-ready" solution which was completely entirely wrong.

I would've loved to have the feedback loop you describe

21. dpkirchner ◴[] No.45232007{6}[source]
I ask the bot to come up with a list of "don't dos"/lessons learned based on what went right or required lots of edits. Then I have it merge them in to an ongoing list. It works OK.
22. brookst ◴[] No.45232104[source]
Claude’s utility really drops when any task requires a working set larger than the context window.

On the one hand, it’s kind or irritating when it goes great-great-great-fail.

On the other hand, it really enforces the best practices of small classes, small files, separation of concerns. If each unit is small enough it does great.

Unfortunately, it’s also fairly verbose and not great at recognizing that it is writing the same code over and over again, so I often find some basic file has exploded to 3000 lines, and a simple “identity repeated logic and move to functions” prompt shrinks it to 500 lines.

23. CuriouslyC ◴[] No.45232144[source]
A few youtubers have done deep dives on this, monitoring claude traffic through a proxy. Subagents don't get the system prompt or anything else, they get their subagent prompt and whatever handoff the main agent gives them.

I was on the subagent hype train myself for a while but as my codebases have scaled (I have a couple of codebases up to almost 400k now) subagents have become a lot more error prone and now I cringe when I see them for anything challenging and immediately escape out. They seem to work great with more greenfield projects though.

replies(1): >>45233653 #
24. CuriouslyC ◴[] No.45232183{5}[source]
Just ask chat gpt about state of the art in context pruning and other methods to optimize the context being provided to a LLM, it's a good research helper. The right mental model is that it's basically like RAG in reverse, instead of trying to select and rank from a data set, you're trying to select and rank from context given a budget.
25. CuriouslyC ◴[] No.45232239{5}[source]
The difference between agents and LLMs is that agents are easy to tune online, because unlike LLMs they're 95% systems software. The prompts, the tools, the retrieval system, the information curation/annotation, context injection, etc. I have a project that's still in early stages that can monitor queries in clickhouse for agent failures, group/aggregate into post mortem classes, then do system paramter optimization on retrieval /document annotation system and invoke DSPy on low efficacy prompts.
26. Yeroc ◴[] No.45232292{3}[source]
I think it's likely that these agent-based development will inevitably add more imperative tools to their arsenal to lower cost, improve speed and accuracy.
27. wild_egg ◴[] No.45233653{3}[source]
I have a bunch of homegrown CLI tools in my $PATH that are only described in the CLAUDE.md file. My subagents use these tools perfectly as if they have full instructions on their use but no such instructions are in the subagent prompts.

This should not be possible if they don't have CLAUDE.md in their context.

My main agent prompt always has a complete ban on the main agent doing any work themselves. All work is done by subagents which they coordinate.

I've been doing this for 2-3 months now on projects upwards of 200k lines and the results have been incredible.

I'm very confused how so many of us can have such completely different experiences with these tools.

replies(1): >>45237358 #
28. mleo ◴[] No.45233943{3}[source]
It’s still early days for these agents. There isn’t any reason the agents won’t build or understand AST in the future to more quickly refactor.
replies(1): >>45234170 #
29. CuriouslyC ◴[] No.45234165{3}[source]
I have a suite of agent tools that is just waiting on my search service for a release, it includes `srefactor` and `spatch` commands that have fuzzy semantic alignment with strong error guards, they use LSP and tree sitter to enable refactoring/patching without line numbers or anything and ensure the patch is correct.
replies(1): >>45234543 #
30. CuriouslyC ◴[] No.45234170{4}[source]
Why do the agents need to build or understand it? Just give them tools to work with it like we would.
replies(1): >>45234288 #
31. LtdJorge ◴[] No.45234288{5}[source]
Everyone talking about MCP and they haven’t figured this out. Actually, JetBrains has an IDE MCP server plugin, although I haven’t tried it.
32. catlifeonmars ◴[] No.45234543{4}[source]
Nice. This sounds like the right approach. As an aside, it’s crazy that a mature LSP server is not a first class requirement for language choice in 2025. I used to write mini LSP servers before working on a project starting when LSP came out a few years ago. Now that there is wider adoption, I don’t find myself reaching for this quite as often, but it’s still a really nice way to ease development on mature codebases that have grown their own design patterns.
33. stingraycharles ◴[] No.45237358{4}[source]
Yes for me the same, I specify using “direnv exec .” as a prefix to every command and the subagents follow this without issue.

On the Claude Code Reddit communities there’s basically a constant outrage about CC’s performance over the past few months, it seems different people have vastly different experiences with these tools.

There appears to be a lot of anecdotal evidence everywhere, and not enough hard facts. Anthropic’s lack of transparency how everything works and interacts is at least a factor in this.