For my money it's by far the best Claude Code compliment.
Let me stop you right there. Are you seriously talking about predictable when talking about a non-deterministic black box over which you have no control?
Predictability and determinism are related but different concepts.
A system can be predictable in a probabilistic sense, rather than an exact, deterministic one. This means that while you may not be able to predict the precise outcome of a single event, you can accurately forecast the overall behavior of the system and the likelihood of different outcomes.
https://philosophy.stackexchange.com/questions/96145/determi...
Similarly, a system can be deterministic yet unpredictable due to practical limitations like sensitivity to initial conditions (chaos theory), lack of information, or the inability to compute predictions in time.
This way of context engineering has definitely been the way to go for me, although I’ve just implemented it myself… using Claude to help generate commands and agents and tweaking them to my liking, lately been using json as well as markdown to share context between steps.
Quantum mechanics is non-deterministic, yet you can predict the motion of objects with exquisite precision.
All these "non-deterministic boxes" will give the same answer to the question "What is the capital of France"
Maybe someone can elaborate better, but it seems there is no such luck trying to map probability onto problems the way "AI" is being used today. It's not just a matter of feeding it more data, but finding what data you haven't fed it or in some cases knowing you can't feed it some data because we have no known way to represent what is obvious to humans.
Having used nearly all of the methods in the original article, I can predict that the output of the model is nearly indistinguishable from a coin toss for many, many, many rather obvious reasons.
The details of how penicillin kills bacteria were discovered in 2000s. Only about half a century of after its commercial production. And I'm quite sure we'll still see some more missing puzzle pieces in the future.
I see one mention brownfield development. Has anyone with experience using these frameworks fired up Claude Code on enterprise software and had confident results? I have unchecked access to Claude Code at work and based on personal agentic coding I’m sure they do aid it. I have decent but not consistent results with my own “system” in our code base. At least until the front end UI components are involved even with Playwright. But I’m curious — how much litter is left behind? How is your coworker tolerance? How large are your pull requests? What is your inference cost? How do these manage parallel?
The README documentation for many have a mix of fevered infomercial, system specific jargon, emoji splatter and someone’s dad’s very specific toolbox organization approach only he understands. Some feel like they’re setting the stage to sell something…trademarked!? Won’t Anthropic and others just incorporate the best of the bunch into their CLI tools in time?
Outside of work I’ve regularly used a reasoning model to produce a ten page spec, wired my project with strictest lint, type check, formatter, hooks, instruct it to check off as it goes and do red green TDD. I can tell gpt-5 in Cursor to “go”, occasionally nudge to stay on task and “ok next” then I’ll end up with what I wanted in time plus gold plating. The last one was a CLI tool for my agent to invoke and track their own work. Anyone with the same tools can just roll their own.
When I'm in the terminal I can call on Agents who can create standardised documents so there is a memory of the product management side of things that extends beyond the context window of Claude.
It guides you through the specification process so that you have extremely tight tasks for Claude to churn through, with any context, documentation and acceptance criteria.
Perhaps there are others similar, but I have found it completely transformative.
Frankly I don’t understand how software engineers (not coders mind you) can have issues with non deterministic tools while browsing the web on a network which can stop working anytime for any reason.
For certain, the results are better when I use it to build new features into our platform - as opposed to making complicated refactors or other deep changes to existing parts of the system. But even in the latter case, if we have good technical documentation capturing the design and how parts of the system work (which we don't in many places), Claude Code can make good progress.
At first I was seeing a fair amount of what I would consider "bad code" - implementation and code that either didn't follow accepted coding style and patterns or that simply wasn't structured for reusability, maintainability. But after strengthening the CLAUDE.md file and adding an "elixir-code-reviewer" subagent which the "developer" persona had to use - the quality of code improved significantly.
Our platform is open source, you can see our current Claude commands and subagents here: https://github.com/Simon-Initiative/oli-torus/tree/master/.c...
In my own experience, this type of stuff is just wishful thinking right now: for anything non-trivial, you still need to monitor Claude Code closely and interrupt when you discover it goes on the wrong train of thought.
Additionally, for security reasons, you don’t want it to give it too many permissions, and/or actually see which commands it’s executing.
The “frameworks” OP talks about are still far away. Right now the best way to think about it is an intern which is usually wrong but can cramp out code at lightning speed.
"Elixir lists do not support index based access via the access syntax"
"Never use else if or elseif in Elixir, always use cond or case for multiple conditionals."
An AI tool finding issues in a set of YAML and Markdown files generated by an AI tool, and humans puzzled by all of it.
> We should really have some code reviewer...
Gemini to the rescue!
Here is the relevant change, it didn't have any sort of hidden complexity: https://github.com/Prunt3D/prunt/commit/b4d7f5e35be6017846b8...
First you'd have to prove that LLMs can be equated to a "top tier human developer"
> I would, in the sense that it will be well designed and implemented code that meets the requirements.
Indeed. Something LLMs can or cannot do with all the predictability of a toss coin.
Would you still call that predictable? Of course you would, as long as they meet your requirements. Put it another way, anything is unpredictable depending on your level of scrutiny. AI is likely less predictable than human, doesn’t mean it isn’t helpful. You are free to dismiss it of course.
I'll put it concisely:
Trying to build predictable result upon unpredictable, not fully understood mechanisms is an extremely common practice in every single field.
But anyway you think LLM is just coin toss so I won't engage with this sub-thread anymore.
Nothing in the current AI world is as predictable as, say, the medicine you can buy or you get prescribed. None of the shamanic "just one more prompt bro" rituals have the predicting power of physics laws. Etc.
You could reflect on that.
> But anyway you think LLM is just coin toss
A person telling me to "try to read comments" couldn't read and understand my comment.
I tend to lean towards them being snake oil. A lot of process and ritual around using them, but for what?
I don't think the models themselves are a good fit for the way these frameworks are being used. It probably goes against their training.
Now we try to poison the context with lots of (for my actual task at hand) useless information so that the model can conform to my superficial song-and-dance process? This seems backwards.
I would argue that we need less context poisoning with useless information. Give the model the most precise information for the actual work to be done and iterate upon that. The song and dance process should happen outside of the context constrained agent.
I do agree that context poisoning is a real thing to watch out for. Coincidentally, I’d noticed MCP endpoint definitions had started taking a substantial block of context for me (~20k tokens), and that’s now something I consider when adopting any MCP.
The new /context cc command is great for visualizing what uses how much of the context.
On the other hand, I'm curious about dagger's container-use MCP. https://container-use.com/agent-integrations
---
link: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
On difference is that we have less control of the context to add/remove things per task necessary.
Why recycle full history into every future turn until you run out of context window?
Perhaps letting agent manage its own context while knowing what an effective context and the harm or going over context or smartly making that tradeoff, it can navigate the tasks better?
Key word: "as long as they meet your requirements".
I've yet to meet an LLM that can predictably do that. Even on the same code with the same tools/prompt/rituals a few hours apart.
> AI is likely less predictable than human, doesn’t mean it isn’t helpful.
I'm struggling to see where I said they weren't helpful or that I dismissed them
Also that study was from early 2025 before Claude 4 which to me was a big break through in productivity, I did not really find these tools too useful before using sonnet 4.
Maybe the future is fine-tuned models on specific coding styles?
Do you know there are approve drugs that have been put in the market for treating one ailment and that have proven to have effect on another or have been shown to have unwanted side effect, and therefore have been shifted? The whole drugs _market_ is full of them and all that is needed is to have enough trial to prove desired effect...
The LLM output is yours to decide if it is relevant to your work or not, but it seems that your experience is consistently subpar with what others have reported.
Yes, I know. Doesn't really disprove my point
> all that is needed is to have enough trial to prove desired effect
all that is needed lol. You mean multi-stage trials with baselines, control groups, testing against placebos etc.?
Compared to "yolo just believe me" of LLMs.
> The LLM output is yours to decide if it is relevant to your work or not, but it seems that your experience is consistently subpar with what others have reported.
Indeed, because all we have to do with those reports is have blind unquestionable faith. "Just one more prompt, and I swear it will be 100% more efficient with literally othing to judge efficiency by, no baselines, nothing".
Huh? Can you elaborate? I thought the claim was that predictable output is the gold standard and variance in LLM output means they can never rival humans.
Please restate if I missed why deterministic output is so important.