←back to thread

280 points zachwills | 2 comments | | HN request time: 0.01s | source
Show context
CuriouslyC ◴[] No.45229400[source]
As someone who's built a project in this space, this is incredibly unreliable. Subagents don't get a full system prompt (including stuff like CLAUDE.md directions) so they are flying very blind in your projects, and as such will tend to get derailed by their lack of knowledge of a project and veer into mock solutions and "let me just make a simpler solution that demonstrates X."

I advise people to only use subagents for stuff that is very compartmentalized because they're hard to monitor and prone to failure with complex codebases where agents live and die by project knowledge curated in files like CLAUDE.md. If your main Claude instance doesn't give a good handoff to a subagent, or a subagent doesn't give a good handback to the main Claude, shit will go sideways fast.

Also, don't lean on agents for refactoring. Their ability to refactor a codebase goes in the toilet pretty quickly.

replies(5): >>45229506 #>>45229671 #>>45230608 #>>45230768 #>>45230775 #
theshrike79 ◴[] No.45229506[source]
I don't use subagents to do things, they're best for analysing things.

Like "evaluate the test coverage" or "check if the project follows the style guide".

This way the "main" context only gets the report and doesn't waste space on massive test outputs or reading multiple files.

replies(1): >>45229574 #
olivermuty ◴[] No.45229574[source]
This is only a problem if an agent is made in a lazy way (all of them).

Chat completion sends the full prompt history on every call.

I am working on my own coding agent and seeing massive improvements by rewriting history using either a smaller model or a freestanding call to the main one.

It really mitigates context poisoning.

replies(3): >>45229616 #>>45229701 #>>45230376 #
mattmanser ◴[] No.45229616[source]
Everyone complains that when you compact the context, Claude tends to get stupid

Which as far as I understand it is summarizing the context with a smaller model.

Am I misunderstanding you, as the practical experience of most people seem to contradict your results.

replies(1): >>45230007 #
NitpickLawyer ◴[] No.45230007[source]
One key insight I have from having worked on this from the early stages of LLMs (before chatgpt came out) is that the current crop of LLM clients or "agentic clients" don't log/write/keep track of success over time. It's more of a "shoot and forget" environment right now, and that's why a lot of people are getting vastly different results. Hell, even week to week on the same tasks you get different results (see the recent claude getting dumber drama).

Once we start to see that kind of self feedback going in next iterations (w/ possible training runs between sessions, "dreaming" stage from og RL, distilling a session, grabbing key insights, storing them, surfacing them at next inference, etc) then we'll see true progress in this space.

The problem is that a lot of people work on these things in silos. The industry is much more geared towards quick returns now, having to show something now, rather than building strong fo0undations based on real data. Kind of an analogy to early linux dev. We need our own Linus, it would seem :)

replies(3): >>45230079 #>>45230179 #>>45232239 #
troupo ◴[] No.45230079{3}[source]
> don't log/write/keep track of success over time.

How do you define success of a model's run?

replies(1): >>45230217 #
1. NitpickLawyer ◴[] No.45230217{4}[source]
Lots of ways. You could do binary thumbs up/down. You could do a feedback session. You could look at signals like "acceptance rate" (for a pr?) or "how many feedback messages did the user send in this session", and so on.

My point was more on tracking these signals over time. And using them to improve the client, not just the model (most model providers probably track this already).

replies(1): >>45231673 #
2. troupo ◴[] No.45231673[source]
Ah. Yes!

My somewhat terse/bitter question was because yesterday Claude would continue claim to have created a "production-ready" solution which was completely entirely wrong.

I would've loved to have the feedback loop you describe