I also like the way they distinguish between "agents" and "workflows", and describe a bunch of useful workflow patterns.
I published some notes on that article when it first came out: https://simonwillison.net/2024/Dec/20/building-effective-age...
A more recent article from Anthropic is https://www.anthropic.com/engineering/built-multi-agent-rese... - "How we built our multi-agent research system". I found this one fascinating, I wrote up a bunch of notes on it here: https://simonwillison.net/2025/Jun/14/multi-agent-research-s...
And then, when you actually do need agents, don’t over complicate it!
This post also introduced the concept of an Augmented LLM — a LLM hooked up to tools, memory, data — which is a useful abstraction for evolving LLM use beyond fancy autocomplete.
“An augmented LLM running in a loop” is the best definition of an agent I’ve heard so far.
And then eventually, with enough sample inputs, create simple functions that can recognize what tools should be used to process a type of input? And only fallback to an LLM agent if the input is novel?
I use Cloudflare's Durable Objects (disclaimer: I'm biased, I work on MCP + Agent things @ Cloudflare). However, I figure building agents probably maps similarly well onto any actor style framework.
Anthropic are leaning more into multi-agent setups where the parent agent might delegate to one or more sub-agents which might run in parallel. They use that trick for Claude Code - I have some notes on reverse-engineering that here https://simonwillison.net/2025/Jun/2/claude-trace/ - and expand on that in their write-up of how Claude Research works: https://simonwillison.net/2025/Jun/14/multi-agent-research-s...
It's still _very_ early in figuring out good patterns for LLM tool-use - the models only got really great at using tools in about the past 6 months, so there's plenty to be discovered about how best to orchestrate them.
See for example the container use MCP which combines both: https://github.com/dagger/container-use
That’s for parallelizing coding work… I’m not sure about other kinds of work. I still see people using workflow builder tools like n8n, Zapier, and maybe CrewAI.
> We suggest that developers start by using LLM APIs directly
Best advice of the whole article by far.
It's insane that people use whole frameworks to send what is essentially an array of strings to a webservice.
We've removed LangChain and LangGraph from our project at work because they are literally worthless, just adding complexity and making you write MORE code than if you didn't use them because you have to deal with their whole boilerplate.
It split them up in a way they would be split up in real life, but in real life there is an assumption that people working on tasks going to communicate with each other. The way it generates tasks resulted in HUGE loss of context (my plan was hella detailed).
I was willing to spend a few more hours trying to make it work rather than doing the work myself. I've opened another chat and split it up into multiple sequential tasks, with a detailed prompt for each task (why, what, how, validation, update documentation reminder etc).
Anyway, orchestrator might work on some super simple tasks, much smaller tasks than those articles make you believe.
Huggingfaces's smolagents library makes the llm generate python code where tools are just normal python functions. If you want parallel tools calls just prompt the llm to do so. It should take care of synchronizing everything. Ofcourse there is the whole issue around executing llm generated code but we have a few solutions for that
https://news.ycombinator.com/item?id=42470541
Building Effective "Agents", 763 points, 124 comments
A decentralized thing would be more for individuals who want more control and transparency. A decentralized public ledger would make it possible to verify that your agent, the agents it interacts with, and the contents of their interactions have not been altered or compromised in any way, whereas a corporate-owned framework could not provide the same level of assurance.
But technically, there's no advantage I can think of for using a public distributed ledger to manage interactions. Agent tasks are pretty ephemeral, so unlike digital currency, there's not really a need to maintain a complete historical log of every action forever. And as far as providing tools for dealing with race conditions, blockchain would be about the least efficient way of creating a mutex imaginable. So technically, just like with non-AI apps, cetralized architecture is always going to be a lot more efficient.
(2) Multi-agent orchestration is difficult to control.
(3) The more capable the model, the lower the need for multi-agents.
(4) The less capable the model, the higher the business case for narrow AI.
If agents become more autonomous and start coordinating across platforms owned by different companies, it might make sense to have some kind of shared, trustless layer (maybe not blockchain but something distributed, auditable and neutral).
I agree that agent tasks are ephemeral, but what about long lived multi-agent workflows or contracts between agents that execute over time? In those cases transparency and integrity might matter more.
I don't think it's one or the other. Centralised systems will dominate in the short term, no doubt about that, but if we're serious about agent ecosystems at scale, we might need more open coordination models too.
I don't think that this correct. Agents benefit is that they can use tools on the fly. Ideally the right tool at the right time.
I.e., Which number is bigger 9.11 or 9.9 -> Agent uses calculator tool. or What is the annual 2020-2023 revenue for Apple -> Financial Statements MCP
Today, it's about API calls and compute. Tomorrow, for any truly autonomous, long-lived agent, it will be about a continuous "existence tax" levied by the platform owner. The orchestrator isn't just a technical component; it's a landlord.
The alternative isn't a more complex framework. It's a permissionless execution layer—a digital wilderness where an agent's survival depends on its own resources, not a platform's benevolence. The debate isn't about efficiency; it's about sovereignty.
For example, a marketing group is interested in agents but needs a guide on how to spec them at a basic level.
There is a figure toward the end and an appendix that starts to drive at this.
Even though it’s new, “how to build them” is an implementation concern.
I'm personally a fan of litellm, but I'm sure alternatives exist.
I still think it has a definite use case in regularising all of your various flows into a common format.
Sure, I could write some code to get SD to do all the steps to generate an image, or write some shader code. But it's so much more organised to use comfy-UI, or a shader graph, especially if I have n>1 flows/tasks, and definitely while experimenting with what I'm building.
Or is it like a burrito (meme explanation of Monads when they were the latest hype)?
One cool example of this in action is seen when you use claude code and ask it to search something. In a verbose setting, it calls an MCP tool to help with search. The tool returns summary of the results with the relevant links (not the raw search result text). A similar method, albeit more robust, is used when Claude is doing deep research as well.
[1]: https://github.com/anthropics/anthropic-cookbook/blob/main/p...
Workflows have a lot more structure and rules about information and control flow. Agents, on the other hand, are often given a set of tools and a prompt. They are much more free-form.
For example, a workflow might define a fuzzy rule like "if customer issue is refund, go to refund flow," while an agent gets customer service tools and figures out how to handle each case on its own.
To me, this is a meaningful distinction to make. Workflows can be more predictable and reliable. Agents have more freedom and can tackle a greater breadth of tasks.
A few clearly defined LLM calls with some light glue logic usually lead to something more stable, easier to debug, and much cheaper to run. The flashy, full-featured agents often end up causing more problems than they solve.
https://www.merriam-webster.com/dictionary/workflow thinks the word dates back to 1921.
There no reason Anthropic can't take that word and present their own alternative definition for it in the context of LLM tool usage, which is what they've done here.
The frameworks just usually add more complexity, obscurity, and API misalignment.
Now the equation can change IF you are getting a lot of observability, experimentation, etc. I think we are just reaching that point of utility where it is a real question whether you should use the framework by default.
For example I build a first version of a product with my own java code hooking right into an API. I was able to deliver the product quickly with a clean architecture and observability, etc. Then once the internal ecosystem was aligned on a framework (on mentioned in the article) a team took up migrating it to python on the framework. It still isn't complete, it just introduces a lot of abstraction layers where you have to adapt them to your internal systems and your internal observability setup, and any other things that the rest of your applications do.
People underestimate that cost. So by default to get your V0 product off the ground (if you are not a complete startup), just use the API. That is my advice.
Concurrent tool call is when LLM writes multiple tool calls instead of one, and you can program your app to execute those sequentially or concurrently. This is a trivial concept.
The "agent framework" layer here is so thin it might as well don't exist, and you can use Anthropic/OAI's sdk directly. I don't see a need for fancy graphs with circles here.
To be fair, I think there might be a space for using Agent Frameworks, but the Agent space is too early for a good enough framework to emerge. The semi contrarian though, which I hold to a certain extent, is that the Agent space is moving so fast that a good enough framework might NEVER emerge.
My experience with langgraph is you spend so much time just fixing stupid runtime type errors because the state of every graph is a stupid JSON blob with very minimal typing, and it's so hard figuring out how data moves through the system. Combined with python's already weak type support, and the fact you're usually dealing with long running processes where things break mid- or end- of process, development becomes quite awful. AI coding assistants only help so much. Tests are hard to write because these frameworks inevitably lean in to the dynamic nature of python.
I just can't understand why people are choosing to build these huge complex systems in an untyped language when the only AI or ML is API calls... or very occasionally doing some lightweight embeddings.
I've read a lot of comments that most pragmatic shops have dumped langchain/graph, haystack, crew etc for their own internal code that does everything more simply, but I can't currently conceptualize how tooling etc is actually done in the real world.
Do you have any links or docs that you've used as a basis for the work you could share? Thanks.
There's plenty of things that you need to make an AI agent that I woudn't want to re-implement or copy and paste each time. The most annoying being automatic conversation history summarization (e.g. I accidentally wasted $60 with the latest OpenAI realtime model, because the costs go up very quickly as the conversation history grows). And I'm sure we'll discover more things like that in the future.
LLMs are amazingly powerful in some ways, but without this kind of "scaffolding", simply not reliable enough to make consistent choices.
---
1. Here are: a) a "language schema" describing what kinds of tags I want and why, with examples, b) The text I want you to tag c) A list of previously-defined tags which could potentially be relevant (simple string match)
List for yourself which pre-existing tags you plan to use when doing tagging.
[LLM generates a list of tags]
2. Here is a,b,c from above, and d) your own tag list
Please write a draft tag.
[LLM writes a draft]
3. Here is a-d from above, plus e) your first draft, and f) Some programmatically-generated "linter" warnings which may or may not be violations of the schema.
Please check over your draft to make sure it follows the schema.
[LLM writes a new draft]
Agent checks for "hard" rules, like making sure there's a 1-1 correlation between the text and the tags. If no rules are violated move to step 5.
4. Here is a-e from above, plus g) your most recent draft, and h) known rule violations. Please fix the errors.
[LLM writes a new draft]
Repeat 4 until no hard rules are broken.
5. [and so on]
> Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.
> ...There are many frameworks that make agentic systems easier to implement. ...These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice. We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code.
https://www.gbif.org/news/6aw2VFiEHYlqb48w86uKSf/chatipt-sys...
It's still in beta.
Press release:
Rukaya Johaadien's chatbot provides conversation-style support to students and researchers who hold biodiversity data but are first-time or infrequent data publishers. Its prompts guide users as it cleans and standardizes spreadsheets, creates basic metadata, and publishes well-structured datasets on GBIF.org as a Darwin Core Archive.
To date, publishing high quality data from PhD and Master's degrees and other small-scale biodiversity research studies has been difficult to do at scale. Standardizing data typically requires specialist knowledge of programming languages, data management techniques, and familiarity with specialist software.
Meanwhile, the process of gaining access to existing instances of the Integrated Publishing Toolkit (IPT)—the GBIF network's workhorse application for data sharing run by node staff with limited time and resources—can test a novice's patience. Training can do little to surmount such logistical barriers and others, like language, when occasional users forget the precise steps and details from year to year.
"Data standardization is hard, and biologists don't become biologists because they like coding or Excel, so a lot of potentially valuable data falls by the wayside," said Johaadien. "Recognizing that large language models have gotten really good at generating code and working with data, I built an automated tool to guide non-technical users through routine questions and process their messy data as much as possible, then publish it quickly and automatically to GBIF."
It's a massive pain in the arse for testing though. Checking which out of X number of things performs the best for your use case is quite annoying if you have to have X implementations. Having one set that you swap out keys and some vars makes this massively easier.
1. Agentic Automation: For every alert/ticket coming in, the agent does a pre-investigation across relevant APIs, DBs, etc, helping identify FPs and providing more context on real ones. Cuts down on human time and speeds up handling.
2. Vibes Investigation: The same agentic reasoning is used when spelunking, where beyond just text2sql, the LLM will spin 2-10 minutes to investigate Splunk, databricks, etc for you.
Underneath, the agent has tools like semantic layers over DBs, large logs/text/dataframe analysers, etc .
Anything an AI agent does that is not that, can be done cheaply and deterministically by some code.
If code can replace humans, it can replace AI.
But, that's just a guess. Maybe the combination of AI and automation adds something special to the mix where a global public ledger becomes more valuable (beyond the hobbyist community) and I'm just not seeing it.
This defines how workflows are used with modern systems in my experience. Workflows are often not predictable, they often execute one of a set of tools based on a response from a previous invocation (e.g. an LLM call).
The only software that we use is Langfuse for observability and that too was breaking down for us. But they launched a new version - V3 - which might still work out for us.
I would suggest to just use standard non-AI specific python libraries and build your own systems. If you are migrating from N8N to a self hosted system then you can actually use NonBioS to build it out for you directly. If you join our discord channels, we can get an engineer to help you out also.