It tends to work better when you give the LLMs some specific narrow subtask to do rather than expecting them to be in the driver's seat.
The move by Cloudflare will totally ruin the AI scraper and the AI agent hype.
Of course agents is now a buzzword that means nothing so there is that.
What Claude Code has taught me is that steering an agent via a test suite is an extremely powerful reinforcement mechanism (the feedback loop leads to success, most of the time) -- and I'm hopeful that new thinking will extend this into the other "soft skills" that an agent needs to become an increasingly effective collaborator.
The way I use AI today is by keeping a pretty tight leash on it, a la Claude Code and Cursor. Not because the models aren't good enough, but because I like to weigh in frequently to provide taste and direction. Giving the AI more agency isn't necessarily desirable, because I want to provide that taste.
Maybe that'll change as I do more and new ergonomics reveal themselves, but right now I don't really want AI that's too agentic. Otherwise, I kind of lose connection to it.
My experience is that, for many workflows, well-done “prompt engineering” is more than enough to make AI models behave more like we’d like without constantly needing us to weight in.
They’ll just get the agent to operate a browser with vision and it’s over. CAPTCHAs were already obsolete like 2-3 years ago.
If we use a real world analogy, think of someone like an architect designing your house. I'm still going to be heavily involved in the design of my house, regardless of how skilled and tasteful the architect is. It's fundamentally an expression of myself - delegating that basically destroys the point of the exercise. I feel the same for a lot of the stuff I'm building with AI now.
We see these patterns do much so that we packaged it up for Airflow (one of the most popular workflow tools)!
It had a lot of moving parts of which agents were the top 30% other systems would interact with. Storing, retrieving and ranking the information was the more important 70% that isn't as glamorous and no one makes courses about.
I still have no idea why everyone is talking about whatever the hottest decoder only model is, encoder only models are a lot more useful for most tasks not directly interfacing with a human.
From your comments, I’d venture a guess that you see your AI-assisted work as a creative endeavor — an expression of your creativity.
I certainly wouldn’t get my hopes up for AI to make innovative jokes, poems and the like. Yet for things that can converge on specific guidelines for matters of taste and preferences, like coding, I’ve been increasingly impressed by how well AI models adapt to our human wishes, even when expressed in ever longer prompts.
For example, a single prompt could tell an llm to make sure a code change doesn't introduce mutability when the same functionality can be achieved with immutable expressions. Another one to avoid useless log statements (with my specific description of what that means).
When I want to evaluate a code change, I run all these prompts separately against it, collecting their structured (with MCP) output. Of course, I incorporate this in my code-agent to provide automated review iterations.
If something escapes where I feel the need to "manually" provide context, I add a new prompt (or figure out how to extend whichever one failed).
I suspect a reason so many people are excited about agents is they are used to "chat assistants" as the primary purpose of LLMs, which is also the ideal use case for agents. The solution space in chat assistants is not defined in advance, and more complex interactions do get value from agents. For example, "find my next free Friday night and send a text to Bob asking if he's free to hang out" could theoretically be programmatically solved, but then you'd need to solve for every possible interaction with the assistant; there are a nearly unlimited number of ways of interfacing with an assistant, so agents are a great solution.
In the end the agentic coding bit was garbage, but i appreciated claude’s help on writing the boilerplate to interface with stockfish
I do agree - the models have good taste and often do things that delight me, but there's always room for me to inject my taste. For example, I don't want the AI to choose what state management solution I use for my Flutter app because I have strong opinions about that.
- creating the right context for parallel and recursive tasks;
- removing some steps (eg, editing its previous response) to show only the corrected output;
- showing it its own output as my comment, when I want a response;
Etc.
Obvious: while the agent can multiply the amount of work I can do, there's a multiplicative reduction in quality, which means I need to account for that (I have to add "time doing curation")
*https://www.slideserve.com/verdi/seng-697-agent-based-softwa...
By the time you got a nice well established context with the right info... just give it to the user.
I like the idea of hallucination-free systems where the LLM merely classifies things at most.
Question -> classifier -> check with user action to take -> act using no AI
I think there's some truth to using the right orchestration for the job, but I think that there's a lot more jobs that could benefit from agentic orchestration than the article would have you believe.
Hard disagree with most of the narrative. Dont start with models, start with Claude Code. For any use case. Go from there depending on costs.
> When NOT to use agents
> Enterprise Automation
Archive this blog.
The real lesson is don't let any company other than the providers dictate what an agent is vs isnt.
Computer use agents are here, they are coming for the desktop of non-technical users, they will provide legitimate RPA capability and beyond, anyone productizing agents will build on top of provider sdks.
I used to build the way most of his examples are just functions calling LLMs. I found it almost necessary due to poor tool selection etc. But I think the leading edge LLMs like Gemini 2.5 Pro and Claude 4 are smart enough and good enough at instruction following and tool selection that it's not necessarily better to create workflows.
I do have a checklist tool and delegate command and may break tasks down into separate agents though. But the advantage of creating instructions and assigning tool commands, especially if you have an environment with a UI where it is easy to assign tool commands to agents and otherwise define them, is that it is more flexible and a level of abstraction above something like a workflow. Even for visual workflows it's still programming which is more brittle and more difficult to dial in.
This was not the case 6-12 months ago and doesn't apply if you insist on using inferior language models (which most of them are). It's really only a handful that are really good at instruction following and tool use. But I think it's worth it to use those and go with agents for most use cases.
The next thing that will happen over the following year or two is going to be a massive trend of browser and computer use agents being deployed. That is again another level of abstraction. They might even incorporate really good memory systems and surely will have demonstration or observation modes that can extract procedures from humans using UIs. They will also learn (record) procedural details for optimization during exploration from verbal or written instructions.
More seriously, yes it makes sense that LLMs are not going to be able to take humans entirely out of the loop. Think about what it would mean if that were the case: if people, on the basis of a few simple prompts could let the agents loose and create sophisticated systems without any further input, the there would be nothing to differentiate those systems, and thus they would lose their meaning and value.
If prompting is indeed the new level of abstraction we are working at, then what value is added by asking Claude: make me a note-taking app? A million other people could also issue this same low-effort prompt; thus what is the value added here by the prompter?
The old adage still applies: there is no free lunch. It makes sense that LLMs are not going to be able to take humans entirely out of the loop.
Think about what it would mean if that were the case: if people, on the basis of a few simple prompts could let the agents loose and create sophisticated systems without any further input, the there would be nothing to differentiate those systems, and thus they would lose their meaning and value.
If prompting is indeed the new level of abstraction we are working at, then what value is added by asking Claude: make me a note-taking app? A million other people could also issue this same low-effort prompt; thus what is the value added here by the prompter?
If you skip the modeling part and rely on something that you don't control being good enough, that's faith not engineering.
The goal _should_ be to avoid doing traditional software engineering or create a system that requires typical engineering to maintain.
Agents with leading edge LLMs allow smart users to have flexible systems that they can evolve by modifying instructions and tools. This requires less technical skill than visual programming.
If you are only taking advantage of the LLM to handle a few wrinkles or a little bit of natural language mapping then you aren't really taking advantage of what they can do.
Of course you can build systems with rigid workflows and sprinkling of LLM integration, but for most use cases it's probably not the right default mindset for mid-2025.
Like I said, I was originally following that approach a little ways back. But things change. Your viewpoint is about a year out of date.
10-15 years ago the challenge in ML/PR was "feature engineering", the careful crafting of rules that would define features in the data which would draw the attention of the ML algorithm.
Then deep learning came along and it solved the issue of feature engineering; just throw massive amounts of data at the problem and the ML algorithms can discern the features automatically, without having to craft them by hand.
Now we've gone as far as we can with massive data, and the problem seems to be that it's difficult to bring out the relevent details when there's so much data. Hence "context engineering", a manual, heuristic-heavy processes guided by trial and error and intuition. More an art than science. Pretty much the same thing that "feature engineering" was.
Although sometimes the difficult part is knowing what to make, and LLMs are great for people who actually know what they want, but don’t know how to do it
You're YOLOing it, and okay that may be fine but may also be a colossal mistake, especially if you remove or never had a human in the loop.
The callout on enterprise automation is interesting b/c it's one of the $T sized opportunities that matters most here, and while I think the article is right in the small, I now think quite differently in the large for what ultimately matters here. Basically, we're crossing the point where one agent written in natural language can easily be worth ~100 python scripts and be much shorter at the same time.
For context, I work with teams in operational enterprise/gov/tech co teams like tier 1+2 security incident response, where most 'alerts' don't get seriously investigated as underresourced & underautomated teams have to just define them away. Basically every since gpt4, it's been pretty insane figuring this stuff out with our partners here. As soon as you get good at prompt templates / plans with Claude Code and the like to make them spin for 10min+ productively, this gets very obvious.
Before agents:
Python workflows and their equivalent. They do not handle variety & evolution because they're hard-coded. Likewise, they only go so far on a task because they're brain dead. Teams can only crank out + maintain so many.
After agents:
You can easily sketch out 1 investigation template in natural language that literally goes 10X wider + 10X deeper than the equiv of Python code, including Python AI workflows. You are now handling much more of the problem.
What's good prompting for one model can be bad for another.
For me, Claude Code completely ignores the instruction to read and follow AGENTS.md, and I have to remind it every time.
The joys of non-deterministic blackboxes.
No.
--- start quote ---
prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:
- training set
- weights
- constraints on the model
- layers between you and the model that transform both your input and the model's output that can change at any time
- availability of compute for your specific query
- and definitely some more details I haven't thought of
https://dmitriid.com/prompting-llms-is-not-engineering
--- end quote ---
Spamming is not only obnoxious, but a very weak example. Spamming is so error tolerant that if 30% of the output is totally wrong, the sender won't notice. Response rates are usually very low. This is a singularly un-demanding problem.
You don't even need "AI" for this. Just score LinkedIn profiles based on keywords, and if the score is high enough, send a spam. Draft a few form letters, and send the one most appropriate for the keywords. Probably would have about the same reply rate.
I have been working on LLMs since 2017, both training some of the biggest and then creating products around them and consider I have no experience with agents.
GPT-3, while being impressive at the time, was too bad to even let it do that, it would break after 1 or 2 steps, so letting it do anything by itself would have been a waste of time where the human in the loop would always have to re-do everything. It's planning ability was too bad and hallucinations way to frequent to be useful in those scenarios.
https://gist.github.com/artpar/60a3c1edfe752450e21547898e801...
(specially the AGENT.knowledge is quite helpful)
it would be helpful to know which models where used in each scenario, otherwise this can largely be ignored
I'd also be interested in your process for creating these files, such as examples of prompts, tools, and references for your research.
See also https://ai.intellectronica.net/the-case-for-ai-workflows
> Can you provide any form of demonstration of an LLM reading these files and acting accordingly
claude does update them at the end of the session (i say wrap up on prompt). the ones you are seeing in that gist are original forms, they evolve with each commit.
Do you know of any kind of write up (by you or someone else) on this topic? Admittedly I never really spent too much time on this since I was working on pre-training, but I did try to do a few smart things with it and it pretty much failed at every thing, in big part because it wasn't even instruction tuned, so was very much still an autocomplete model.
So would be curious to learn more about how people got it to succeeed at agentic behaviors.
Do you think, just maybe, it might be interesting to play around with these tools without worrying about how productive you're being?
I'd have to do this anyways, if I was writing the code myself, so this is not "time above what I'd normally spend"
The visuals it makes for me I can inspect and easily tell if it is on the right path, or wrong. The test suite is a sharper notion of "this is right, this is wrong" -- more sharp than just visual feedback and my directions.
The basic idea is to setup a feedback loop for the agent, and then keep the agent in the loop, and observe what it is doing. The visuals are absolutely critical -- as a compressed representation of the behavior of the codebase, which I can quickly and easily parse and recognize if there are issues.
I think you’ll find that after 10 years one’ll look back on oneself at 5 years’ experience and realise that one wasn’t an expert back then. The same is probably true of 20 years looking back on 10.
Given a median career of about 40 years, I think it’s fair to estimate that true expertise takes at least 10–15 years.
> most agent systems break down from too much complexity, not too little
...when the platform wasn't made to handle complexity. The main problem is that the "frameworks" are not good enough for agentic workloads, which naturally will scale into complex stateful chaos. This requires another approach, but all that is done is delegating this to LLMs. As the author says "A coordinator agent that managed task delegation", which is the wrong way, an easy exit, like "maybe it will vibe-state itself?".
Agentic systems existed before LLMs (check ABM), and nowadays most ppl confuse what LLMs give us (all-knowing subconscious DBs) with agency, which is a purpose of completing a process. Eg a bus driver is an agent, but you dont ask a bus driver to play the piano. It has predefined behavior, within a certain process.
Another common mistake is considering a prompt (with or without history) an agent. It's just a DB model which you query. A deep research agent has 3 prompts: Check if an answer is possible, scrape, and answer. These are NOT 3 agents - these are DB queries. Delegating logical decisions to LLMs without verification is like having a drunk bus driver. A new layer is needed, which is where all the python frameworks offer it on top of their prompts. That's a mistake, because it splits the control flow, and managing complex state with FSMs or imperative code will soon hit a wall.
Declarative programming to the rescue - this is the only (and also natural) way of handling live and complex systems. It has to be done from the bottom up and it will change the paradigm of the whole agent. I've worked on this exact approach for a while now, and besides handling complexity, the 2nd challenge is navigating through it easily, to find answers to your questions (what and when, exactly, went wrong). I let LLMs "build" the dynamic parts of the agent (like planning), but keeping them in IoC - only the agent layer makes decisions. Another important thing - small prompts, with a single task; 100 focused prompts is better then 1 pasta-prompt. Again, without a proper control flow, synchronizing 100 co-dependent prompts can be tricky (when approached imperatively, with eg a simple loop).
Theres more to it, and I recommend checking out my agents (research and cook), either as a download, source code, or a video walk-through [0].
PS. Embrace chaos, and the chaos will embrace you.
TLDR; toy-frameworks in python, ppl avoiding coding, drunk LLMs
My ideea of a good time is understanding the system in depth and building it while trusting it does what I expect. This is going away, though.
By saying "you should just use agents", anyone who has read the article will assume that you're talking about the case where there's no human in the loop.
If anything such tedious obsessions can just cloud a person's mind against creating something interesting that does turn out to also have long-term import. I mean, I assume i'm talking to either a troll or an idiot given the weird rant you replied with, but it's good to remember that value doesn't always come in a specifically molded form.