Having more context, but leaving open an inability to effectively focus on the latest task is the real problem.
Having more context, but leaving open an inability to effectively focus on the latest task is the real problem.
Coding agents choke on our big C++ code-base pretty spectacularly if asked to reference large files.
People have tried to expand context windows by reducing the O(n^2) attention mechanism to something more sparse and it tends to perform very poorly. It will take a fundamental architectural change.
I have multiple things I'd love LLMs to attempt to do, but the context window is stopping me.
You load the thing up with relevant context and pray that it guides the generation path to the part of the model that represents the information you want and pray that the path of tokens through the model outputs what you want
That's why they have a tendency to go ahead and do things you tell them not to do..
also IDK about you but I hate how much praying has become part of the state of the art here. I didn't get into this career to be a fucking tech priest for the machine god. I will never like these models until they are predictable, which means I will never like them.
I know there classes of problems that LLMs can't natively handle (like doing math, even simple addition... or spatial reasoning, I would assume time's in there too). There are ways they can hack around this, like writing code that performs the math.
But how would you do that for chronological reasoning? Because that would help with compacting context to know what to remember and what not.
That is, humans usually don't store exactly what was written in as sentence five paragraphs ago, but rather the concept or idea conveyed. If we need details we go back and reread or similar.
And when we write or talk, we form first an overall thought about what to say, then we break it into pieces and order the pieces somewhat logically, before finally forming words that make up sentences for each piece.
From what I can see there's work on this, like this[1] and this[2] more recent paper. Again not an expert so can't comment on the quality of the references, just some I found.
In fact I've found LLMs are reasonable at the simple task of refactoring a large file into smaller components with documentation on what each portion does even if they can't get the full context immediately. Doing this then helps the LLM later. I'm also of the opinion we should be making codebases LLM compatible. So if it happens i direct the LLM that way for 10mins and then get back to the actual task once the codebase is in a more reasonable state.
I could see in C++ it getting smarter about first checking the .h files or just grepping for function documentation, before actually trying to pull out parts of the file.
That is what transformers attention does in the first place, so you would just be stacking two transformers.
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
https://www.laws-of-software.com/laws/kernighan/
Sure, you eat the elephant one bite at a time, and recursion is a thing but I wonder where the tipping point here is.
Of course, subagents are a good solution here, as another poster already pointed out. But it would be nice to have something more lightweight and automated, maybe just turning on a mode where the LLM is asked to throw things out according to its own judgement, if you know you're going to be doing work with a lot of context pollution.
You may appreciate this illustration I made (largely with AI, of course): https://imgur.com/a/0QV5mkS
The context (heheheh) is a long-ass article on coding with AI I wrote eons ago that nobody ever read, if anybody is curious: https://news.ycombinator.com/item?id=40443374
Looking back at it, I was off on a few predictions but a number of them are coming true.
I really want to paraphrase kernighan's law as applied to LLMs. "If you use your whole context window to code a solution to a problem, how are you going to debug it?".
The new session throws away whatever behind-the-scenes context was causing problems, but the prepared prompt gets the new session up and running more quickly especially if picking up in the middle of a piece of work that's already in progress.
Language itself is a highly compressed form of compressed context. Like when you read "hoist with one's own petard" you don't just think about literal petard but the context behind this phrase.
"that's because a next token predictor can't "forget" context. That's just not how it works."
An LSTM is also a next token predictor and literally have a forget gate, and there are many other context compressing models too which remember only the what it thinks is important and forgets the less important, like for example: state-space models or RWKV that work well as LLMs too. But even just a the basic GPT model forgets old context since it's gets truncated if it cannot fit, but that's not really the learned smart forgetting the other models do.
Look carefully at a context window after solving a large problem, and I think in most cases you'll see even the 90th percentile token --- to say nothing of the median --- isn't valuable.
However large we're allowing frontier model context windows to get, we've got integer multiple more semantic space to allocate if we're even just a little bit smart about managing that resource. And again, this is assuming you don't recurse or divide the problem into multiple context windows.
This is how I designed my LLM chat app (https://github.com/gitsense/chat). I think agents have their place, but I really think if you want to solve complex problems without needlessly burning tokens, you will need a human in the loop to curate the context. I will get to it, but I believe in the same way that we developed different flows for working with Git, we will have different 'Chat Flows' for working with LLMs.
I have an interactive demo at https://chat.gitsense.com which shows how you can narrow the focus of the context for the LLM. Click "Start GitSense Chat Demos" then "Context Engineering & Management" to go through the 30 second demo.
Can you share you prompt?
Well, not so much the project organization stuff - it wants to stuff everything into one header and has to be browbeaten into keeping implementations out of headers.
But language semantics? It's pretty great at those. And when it screws up it's also really good at interpreting compiler error messages.
The replies of "well, just change the situation, so context doesn't matter" is irrelevant, and off-topic. The rationalizations even more so.
on actual bottles without any metaphors, the bottle neck is narrower because humans mouths are narrower.
In theory you could attach metadata (with timestamps) to these turns, or include the timestamp in the text.
It does not affect much, other than giving the possibility for the model to make some inferences (eg. that previous message was on a different date, so its "today" is not the same "today" as in the latest message).
To chronologically fade away the importance of a conversation turn, you would need to either add more metadata (weak), progressively compact old turns (unreliable) or post-train a model to favor more recent areas of the context.
I think with appropriate instructions in the system prompt it could probably work on this code-base more like I do (heavy use of Ctrl-, in Visual Studio to jump around and read only relevant portions of the code-base).
I think the important part is to give it (in my case, these days "it" is gpt-5-codex) a target persona, just like giving it a specific problem instead of asking it to be clever or creative. I've never asked it for a summary of a long conversation without the context of why I want the summary and who the intended audience is, but I have to imagine that helps it frame its output.
The main thing is people have already integrated AI into their workflows so the "right" way for the LLM to work is the way people expect it to. For now I expect to start multiple fresh contexts while solving a single problem until I can setup a context that gets the result I want. Changing this behavior might mess me up.
I hadn't considered actually rolling my own for day-to-day use, but now maybe I will. Although it's worth noting that Claude Code Hooks do give you the ability to insert your own code into the LLM loop - though not to the point of Eternal Sunshining your context, it's true.
GPT-5 is brilliant when it oneshots the right direction from the beginning, but pretty unmanageable when it goes off the rails.
Tools like Aider create a code map that basically indexes code into a small context. Which I think is similar to what we humans do when we try to understand a large codebase.
I'm not sure if Aider can then load only portions of a huge file on demand, but it seems like that should work pretty well.
That may be the foundation for an innovation step in model providers. But you can achieve a poor man’s simulation if you can determine, in retrospect, when a context was at peak for taking turns, and when it got too rigid, or too many tokens were spent, and then simply replay the context up until that point.
I don’t know if evaluating when a context is worth duplicating is a thing; it’s not deterministic, and it depends on enforcing a certain workflow.
The same protection works in reverse, if a subagent goes off the rails and either self aborts or is aborted, that large context is truncated to the abort response which is "salted" with the fact that this was stopped. Even if the subagent goes sideways and still returns success (Say separate dev, review, and test subagents) the main agent has another opportunity to compare the response and the product against the main context or to instruct a subagent to do it in a isolated context..
Not perfect at all, but better than a single context.
One other thing, there is some consensus that "don't" "not" "never" are not always functional in context. And that is a big problem. Anecdotally and experimental, many (including myself) have seen the agent diligently performing the exact thing following a "never" once it gets far enough back in the context. Even when it's a less common action.
That said, some of the models out there (Gemini 2.5 Pro, for example) support 1M context; it's just going to be expensive and will still probably confuse the model somewhat when it comes to the output.
This isn't a misconception. Context is a limitation. You can effectively have an AI agent build an entire application with a single prompt if it has enough (and the proper) context. The models with 1m context windows do better. Models with small context windows can't even do the task in many cases. I've tested this many, many, many times. It's tedious, but you can find the right model and the right prompts for success.