Most active commenters

mccoyb(5)
noosphr(4)
mindwok(4)
tptacek(3)
dmezzetti(3)
thimabi(3)
malfist(3)
ilaksh(3)
bonzini(3)
Voloskaya(3)

Popular/hot comments

>>44450452 #
>>44450544 #
>>44450552 #
>>44450601 #
>>44450542 #
>>44450569 #
>>44450609 #
>>44451021 #

What to build instead of AI agents

(decodingml.substack.com)

1. ldjkfkdsjnv ◴[03 Jul 25 00:49 UTC] No.44450452[source]▶

This is all going to be solved by better models. Building agents is building for a world that doesn't quite exist yet, but probably will in a year or two. Building some big heuristic engine that strings together LLM calls (which is what this blog advocates for) is essentially a bet against progress in ai. I'm not taking that bet, and neither are any of the major players.

replies(7): >>44450462 #>>44450475 #>>44450503 #>>44450507 #>>44450563 #>>44450783 #>>44452156 #

2. candiddevmike ◴[03 Jul 25 00:51 UTC] No.44450462[source]▶

>>44450452 #

There are perverse incentives against admitting that the AI boom music is probably stopping and grabbing a chair, better to keep stringing along investors with more AGI thought leadership.

replies(1): >>44450491 #

3. snek_case ◴[03 Jul 25 00:52 UTC] No.44450475[source]▶

>>44450452 #

I've been reviewing papers for NeurIPS and I can tell you that many of the submissions are using various strategies to string together LLM calls for various purposes.

It tends to work better when you give the LLMs some specific narrow subtask to do rather than expecting them to be in the driver's seat.

4. tptacek ◴[03 Jul 25 00:54 UTC] No.44450491{3}[source]▶

>>44450462 #

This comment has nothing to do with either the comment it replies to or the original post, neither of which have anything whatsoever to do with "AGI thought leadership".

replies(1): >>44451083 #

5. mccoyb ◴[03 Jul 25 00:56 UTC] No.44450503[source]▶

>>44450452 #

What do "better models" and "progress in ai" mean to you? Without more information, it's impossible to respond sincerely or precisely.

6. tptacek ◴[03 Jul 25 00:56 UTC] No.44450507[source]▶

>>44450452 #

Maybe! But it seems like it's points well taken today. The important thing I think to keep in mind is that LLM calls themselves, anything that's happening inside the LLM, is stochastic. Even with drastically better models, I still can't tell myself a story that I can rely on specific outputs from an LLM call. Their outputs today are strong enough for a variety of tasks, when LLMs are part of the fabric of a program's logic --- in agent systems --- you need an expert human involved to notice when things go off the rails.

7. transcriptase ◴[03 Jul 25 00:58 UTC] No.44450515[source]▶

>>44450160 (OP) #

I love that there are somehow authorities on tech that realistically they could have 1-2 years experience with tops. It’s the reverse of the “seeking coder with 10 years of experience in a 2 year old language” meme.

replies(2): >>44450544 #>>44450763 #

8. dmezzetti ◴[03 Jul 25 00:59 UTC] No.44450535[source]▶

>>44450160 (OP) #

This article is missing an even more important point: you don't always need to start with an LLM, plain old coding still solves a lot of problems.

replies(2): >>44450609 #>>44451032 #

9. rvz ◴[03 Jul 25 01:01 UTC] No.44450542[source]▶

>>44450160 (OP) #

Imagine playing the AI agent and scraper cat-and-mouse game all for Cloudflare to block them all by default all to navigate web pages and waste millions worth of tokens just to operate a computer.

The move by Cloudflare will totally ruin the AI scraper and the AI agent hype.

replies(3): >>44450634 #>>44450726 #>>44452032 #

10. noosphr ◴[03 Jul 25 01:01 UTC] No.44450544[source]▶

>>44450515 #

I've been building what's called ai agents since gpt3 came out. There are plenty of other people who did the same thing. That's five years now. If you can't be an expert after 5 years then there is no such thing as experts.

Of course agents is now a buzzword that means nothing so there is that.

replies(7): >>44450574 #>>44450635 #>>44450669 #>>44451016 #>>44451590 #>>44452288 #>>44454512 #

11. mccoyb ◴[03 Jul 25 01:03 UTC] No.44450552[source]▶

>>44450160 (OP) #

Building agents has been fun for me, but it's clear that there are serious problems with "context engineering" that must be overcome with new ideas. In particular, no matter how big the context window size is increased - one must curate what the agent sees: agents don't have very effective filters on what is relevant to supercharge them on tasks, and so (a) you must leave *.md files strewn about to help guide them and (b) you must put them into roles. The *.md system is essentially a rudimentary memory system, but it could get be made significantly more robust, and could involve e.g. constructing programs and models (in natural language) on the fly, guided by interactions with the user.

What Claude Code has taught me is that steering an agent via a test suite is an extremely powerful reinforcement mechanism (the feedback loop leads to success, most of the time) -- and I'm hopeful that new thinking will extend this into the other "soft skills" that an agent needs to become an increasingly effective collaborator.

replies(4): >>44450945 #>>44451021 #>>44452834 #>>44453646 #

12. m82labs ◴[03 Jul 25 01:05 UTC] No.44450563[source]▶

>>44450452 #

So for people building real things today are you saying instead of stringing prompts together with logic we should just sit on our hands for a year and wait for the models to catch up to the agent paradigm?

replies(1): >>44450626 #

13. mindwok ◴[03 Jul 25 01:06 UTC] No.44450569[source]▶

>>44450160 (OP) #

I'm not yet convinced (though I remain open to the idea) that AI agents are going to be a widely adopted pattern in the way people on LinkedIn suggest.

The way I use AI today is by keeping a pretty tight leash on it, a la Claude Code and Cursor. Not because the models aren't good enough, but because I like to weigh in frequently to provide taste and direction. Giving the AI more agency isn't necessarily desirable, because I want to provide that taste.

Maybe that'll change as I do more and new ergonomics reveal themselves, but right now I don't really want AI that's too agentic. Otherwise, I kind of lose connection to it.

replies(3): >>44450601 #>>44450841 #>>44451530 #

14. GPerson ◴[03 Jul 25 01:06 UTC] No.44450574{3}[source]▶

>>44450544 #

5 years is barely a beginner in lots of fields.

replies(2): >>44450591 #>>44450720 #

15. deadbabe ◴[03 Jul 25 01:09 UTC] No.44450589[source]▶

>>44450160 (OP) #

A key thing we may be forced to admit someday is that AI agents are really just expensive temporary glue that we use to build services quickly until we have cheaper hard coded functions developed once the AI agent gives us sufficient experience with the scope of the problem domain.

replies(1): >>44450697 #

16. hinterlands ◴[03 Jul 25 01:10 UTC] No.44450591{4}[source]▶

>>44450574 #

More to the point, it's a field where we're constantly told that our experiences from a month ago are in no way relevant and that the latest thing is fundamentally different to what we know. Should expertise degrade just as quickly?

replies(1): >>44451954 #

17. thimabi ◴[03 Jul 25 01:11 UTC] No.44450601[source]▶

>>44450569 #

Do you think that, over time, knowing how the models behave, simply providing more/better context and instructions can fill this gap of wanting to provide taste and direction to the models’ outputs and actions?

My experience is that, for many workflows, well-done “prompt engineering” is more than enough to make AI models behave more like we’d like without constantly needing us to weight in.

replies(4): >>44450627 #>>44450642 #>>44451992 #>>44452051 #

18. riku_iki ◴[03 Jul 25 01:13 UTC] No.44450609[source]▶

>>44450535 #

but you can't build 5B startup in 10 months with plain old coding..

replies(3): >>44450738 #>>44450806 #>>44452134 #

19. ldjkfkdsjnv ◴[03 Jul 25 01:16 UTC] No.44450626{3}[source]▶

>>44450563 #

if you are in a competitive market you will lose with this approach

20. apwell23 ◴[03 Jul 25 01:16 UTC] No.44450627{3}[source]▶

>>44450601 #

taste cannot be reduced to a bunch of instructions.

21. joeblubaugh ◴[03 Jul 25 01:17 UTC] No.44450631[source]▶

>>44450160 (OP) #

Why do so many examples break down to “send better spam faster”?

replies(2): >>44450768 #>>44451385 #

22. oceanplexian ◴[03 Jul 25 01:17 UTC] No.44450634[source]▶

>>44450542 #

I don’t see how CloudFlare is going to realistically block someone who is trying hard enough.

They’ll just get the agent to operate a browser with vision and it’s over. CAPTCHAs were already obsolete like 2-3 years ago.

replies(1): >>44451330 #

23. apwell23 ◴[03 Jul 25 01:17 UTC] No.44450635{3}[source]▶

>>44450544 #

curious what did you build? experience only counts if you are shipping right?

replies(1): >>44450744 #

24. mindwok ◴[03 Jul 25 01:18 UTC] No.44450642{3}[source]▶

>>44450601 #

I suppose it's possible, although the models would have to have a really nuanced understanding about my tastes and even then it seems doubtful.

If we use a real world analogy, think of someone like an architect designing your house. I'm still going to be heavily involved in the design of my house, regardless of how skilled and tasteful the architect is. It's fundamentally an expression of myself - delegating that basically destroys the point of the exercise. I feel the same for a lot of the stuff I'm building with AI now.

replies(1): >>44450756 #

25. Mengkudulangsat ◴[03 Jul 25 01:23 UTC] No.44450669{3}[source]▶

>>44450544 #

Jiro's son is only allowed to make sushi after 30 years.

replies(2): >>44450699 #>>44451009 #

26. Onewildgamer ◴[03 Jul 25 01:28 UTC] No.44450697[source]▶

>>44450589 #

An interesting take, only if the stakes are low when the decisions are wrong. I'm not confident to have an LLM taking decisions for a customer or me. I'd rather have it suggest things to customers, sugesstive actions and some useful insights that user may have overlooked.

replies(1): >>44450771 #

27. noosphr ◴[03 Jul 25 01:28 UTC] No.44450699{4}[source]▶

>>44450669 #

Jiro's son is only allowed to make sushi when Jiro is about to retire.

28. jlaneve ◴[03 Jul 25 01:28 UTC] No.44450704[source]▶

>>44450160 (OP) #

We’re seeing this all the time - taking traditional workflow orchestration tools and instrumenting LLMs as part of it. It becomes a lot easier to build these because the complexity comes from a) the model, which frontier labs are making easy and b) productionizing a workflow, which workflow orchestration tools make easy. It’s also easy to recognize value because these workflows are often grounded in existing work and thus easy to measure.

We see these patterns do much so that we packaged it up for Airflow (one of the most popular workflow tools)!

https://github.com/astronomer/airflow-ai-sdk

29. tokioyoyo ◴[03 Jul 25 01:31 UTC] No.44450720{4}[source]▶

>>44450574 #

Don't take it wrong way, but it's software. It's not that deep for 99% of cases.

30. iphone_elegance ◴[03 Jul 25 01:32 UTC] No.44450726[source]▶

>>44450542 #

Eh, if you control the browser or provide a good plugin you can just have users scrape for you

31. itomato ◴[03 Jul 25 01:33 UTC] No.44450727[source]▶

>>44450160 (OP) #

MCP Server for Disco/Prom or Celonis?

32. nine_k ◴[03 Jul 25 01:34 UTC] No.44450738{3}[source]▶

>>44450609 #

Is coding the bottleneck there?

replies(1): >>44450811 #

33. noosphr ◴[03 Jul 25 01:34 UTC] No.44450744{4}[source]▶

>>44450635 #

The biggest thing was an internal system for medium frequency trading.

It had a lot of moving parts of which agents were the top 30% other systems would interact with. Storing, retrieving and ranking the information was the more important 70% that isn't as glamorous and no one makes courses about.

I still have no idea why everyone is talking about whatever the hottest decoder only model is, encoder only models are a lot more useful for most tasks not directly interfacing with a human.

34. JSR_FDED ◴[03 Jul 25 01:35 UTC] No.44450751[source]▶

>>44450160 (OP) #

After spending the last 3 weeks to get agents to work reliably I went for a much simpler pattern. Agents feel like they’re in the six fingers on a hand stage.

35. thimabi ◴[03 Jul 25 01:36 UTC] No.44450756{4}[source]▶

>>44450642 #

Can you share some examples of things you’ve been building with AI?

From your comments, I’d venture a guess that you see your AI-assisted work as a creative endeavor — an expression of your creativity.

I certainly wouldn’t get my hopes up for AI to make innovative jokes, poems and the like. Yet for things that can converge on specific guidelines for matters of taste and preferences, like coding, I’ve been increasingly impressed by how well AI models adapt to our human wishes, even when expressed in ever longer prompts.

replies(2): >>44450860 #>>44450872 #

36. zzzeek ◴[03 Jul 25 01:37 UTC] No.44450763[source]▶

>>44450515 #

Totally my reaction - "I've worked with dozens of teams ....". Really ?

replies(1): >>44451052 #

37. malfist ◴[03 Jul 25 01:38 UTC] No.44450768[source]▶

>>44450631 #

lol, that was literally their example wasn't it? Troll linkedin looking for people and spam them with "personalized" emails.

replies(1): >>44452278 #

38. malfist ◴[03 Jul 25 01:39 UTC] No.44450771{3}[source]▶

>>44450697 #

Can you imagine a bank taking this approach? Sorry, we didn't have enough time to build a true ledger, and now the AI says you have no money.

replies(1): >>44450831 #

39. abelanger ◴[03 Jul 25 01:39 UTC] No.44450777[source]▶

>>44450160 (OP) #

Agents depend heavily on the quality of their individual components, so it's pretty obvious that demo agents are going to be incredibly unstable. You need a success rate for each individual component to be near 100% or build in a mechanism for corrective action (one of the things that Claude Code does particularly well).

40. malfist ◴[03 Jul 25 01:40 UTC] No.44450783[source]▶

>>44450452 #

People have been saying "solved by better models" for 5 years now. Still waiting on it.

41. dmezzetti ◴[03 Jul 25 01:44 UTC] No.44450806{3}[source]▶

>>44450609 #

There are plenty of AI companies solving interesting problems and possibly worth it. But most problems are more simple than that and that hasn't changed.

42. thimabi ◴[03 Jul 25 01:45 UTC] No.44450811{4}[source]▶

>>44450738 #

Without it being attached to the AI hype, it surely is. In the current tech landscape, there’s a tendency to stuff AI into everything, so anything that doesn’t include it ends up being left behind.

43. collingreen ◴[03 Jul 25 01:47 UTC] No.44450831{4}[source]▶

>>44450771 #

This is happening across lots of industries right now. Some are ok like the car company that had to sell the car at the insanely low price their agent promised but some are terrible like the United healthcare "90% wrong when denying coverage" one or the "who should be fired" prompt from doge.

44. afc ◴[03 Jul 25 01:48 UTC] No.44450841[source]▶

>>44450569 #

My thinking is that over time I can incrementally codify many of these individual "taste" components as prompts that each review a change and propose suggestions.

For example, a single prompt could tell an llm to make sure a code change doesn't introduce mutability when the same functionality can be achieved with immutable expressions. Another one to avoid useless log statements (with my specific description of what that means).

When I want to evaluate a code change, I run all these prompts separately against it, collecting their structured (with MCP) output. Of course, I incorporate this in my code-agent to provide automated review iterations.

If something escapes where I feel the need to "manually" provide context, I add a new prompt (or figure out how to extend whichever one failed).

45. rm999 ◴[03 Jul 25 01:51 UTC] No.44450858[source]▶

>>44450160 (OP) #

A really short version of it is that you don't need an agent if you have a well-defined solution that can be implemented in advance (e.g. the 'patterns' in this article). Programmers often work on problems that have programmatic solutions and then the advice is totally correct: reach for simpler more reliable solutions. In the future AIs will probably be smart enough to just brute force any problem, but for now this is adding unneeded complexity.

I suspect a reason so many people are excited about agents is they are used to "chat assistants" as the primary purpose of LLMs, which is also the ideal use case for agents. The solution space in chat assistants is not defined in advance, and more complex interactions do get value from agents. For example, "find my next free Friday night and send a text to Bob asking if he's free to hang out" could theoretically be programmatically solved, but then you'd need to solve for every possible interaction with the assistant; there are a nearly unlimited number of ways of interfacing with an assistant, so agents are a great solution.

replies(1): >>44451041 #

46. QuadmasterXLII ◴[03 Jul 25 01:51 UTC] No.44450860{5}[source]▶

>>44450756 #

One example: as a trial, I wanted to work out how frequently an 1400 rated chess player can get a particular opening trap. I intended to check this for all the traps, so it needed to be fast. With a surprising amount of handholding, claude code downloaded the relevant file from lichess. Its method of computing the probability was wrong, so I told it the formula to use and it got the right answer, but incredibly slowly. I asked it to precompute and cache a datas structure for accelerating these queries and it splashed around ineffectually for a long time with sqlite while I made dinner. I came back and clarified that just sorting all the games in the rating range and pickling that list of strings was a fine datastructure, then use binary search to do the probability in log(n) time. It managed to use binary search in o(n) time so I folded and wrote the hot loop myself. this got the query back to ~1 ms.

In the end the agentic coding bit was garbage, but i appreciated claude’s help on writing the boilerplate to interface with stockfish

47. mindwok ◴[03 Jul 25 01:52 UTC] No.44450872{5}[source]▶

>>44450756 #

I use AI for coding - most of the projects I've built have been fun toys (chore tracking apps, Flutter apps to help my parents), but I've also built one commercial money making app.

I do agree - the models have good taste and often do things that delight me, but there's always room for me to inject my taste. For example, I don't want the AI to choose what state management solution I use for my Flutter app because I have strong opinions about that.

replies(1): >>44451845 #

48. zmgsabst ◴[03 Jul 25 02:06 UTC] No.44450945[source]▶

>>44450552 #

I’ve found managing the context is most of the challenge:

- creating the right context for parallel and recursive tasks;

- removing some steps (eg, editing its previous response) to show only the corrected output;

- showing it its own output as my comment, when I want a response;

Etc.

replies(2): >>44451001 #>>44451616 #

49. mccoyb ◴[03 Jul 25 02:20 UTC] No.44451001{3}[source]▶

>>44450945 #

I've also found that relying on agents to build their own context _poisons_ it ... that it's necessary to curate it constantly. There's kind of a <1 multiplicative thing going on, where I can ask the agent to e.g. update CLAUDE.mds or TODO.mds in a somewhat precise way, and the agent will multiply my request in a lot of changes which (on the surface) appear well and good ... but if I repeat this process a number of times _without manual curation of the text_, I end up with "lower quality" than I started with (assuming I wrote the initial CLAUDE.md).

Obvious: while the agent can multiply the amount of work I can do, there's a multiplicative reduction in quality, which means I need to account for that (I have to add "time doing curation")

replies(1): >>44451472 #

50. ecb_penguin ◴[03 Jul 25 02:21 UTC] No.44451009{4}[source]▶

>>44450669 #

Yeah, but that's ego. You wouldn't be able to pick out Jiro's sushi in a blind taste test of many Tokyo sushi restaurants. If other people can replicate what you do, then the 30 years doesn't serve any actual purpose.

51. skeeter2020 ◴[03 Jul 25 02:22 UTC] No.44451016{3}[source]▶

>>44450544 #

I took a course* on agent based system in grad school in 2006, but nobody has been building what agents mean today for 5 or even 3 years.

*https://www.slideserve.com/verdi/seng-697-agent-based-softwa...

replies(1): >>44451588 #

52. franktankbank ◴[03 Jul 25 02:24 UTC] No.44451021[source]▶

>>44450552 #

Is there a recommended way to construct .md files for such a system? For instance when I make them for human consumption they'd have lots of markup for readability but that may or may not be consumable by an llm. Can you create a .md the same as for human consumption that doesn't hinder an llm?

replies(3): >>44451212 #>>44451577 #>>44452639 #

53. skeeter2020 ◴[03 Jul 25 02:25 UTC] No.44451032[source]▶

>>44450535 #

It's funny how when I talk to ML practitioners who have experience & work in the field they're the most pragmatic voices, like our staff developer on the ML team: "if you can solve the problem algorithmically you should definitely do that!"

replies(1): >>44453709 #

54. franktankbank ◴[03 Jul 25 02:27 UTC] No.44451041[source]▶

>>44450858 #

Works great when you can verify the response quicker than it would take to just do yourself. Personally I have a hard ass time trusting it without verifying.

55. zmmmmm ◴[03 Jul 25 02:28 UTC] No.44451052{3}[source]▶

>>44450763 #

Which means they had at best shallow involvement and left the scene pretty quickly. Probably no realistic idea whether the systems created survived long term impact with reality or not. But hey, free advice!

56. bGl2YW5j ◴[03 Jul 25 02:33 UTC] No.44451083{4}[source]▶

>>44450491 #

There's an implication in the messaging of most of these blogs that LLMs and the approach the blog describes, is verging on AGI.

replies(1): >>44451159 #

57. bravesoul2 ◴[03 Jul 25 02:38 UTC] No.44451108[source]▶

>>44450160 (OP) #

Oh I wad hoping this would go back another step in the 5 whys and why use an LLM conversationally at all.

By the time you got a nice well established context with the right info... just give it to the user.

I like the idea of hallucination-free systems where the LLM merely classifies things at most.

Question -> classifier -> check with user action to take -> act using no AI

58. btown ◴[03 Jul 25 02:44 UTC] No.44451138[source]▶

>>44450160 (OP) #

When I see things like "The coordinator threw up its hands when tasks weren't clearly defined" but the conclusion is to not use a coordinator at all in favor of imperative logic... it's really hard to know how much of this could be solved by using much more specific prompts/tool descriptions, and using interim summarization/truncation LLM passes to ensure that the amount of context from prior tool outputs doesn't overwhelm the part of context that describes the tools themselves and their recommended use cases. And when the article doesn't even provide a single example of a long-form tool description or prompt that would actually be used in practice...

I think there's some truth to using the right orchestration for the job, but I think that there's a lot more jobs that could benefit from agentic orchestration than the article would have you believe.

59. tptacek ◴[03 Jul 25 02:48 UTC] No.44451159{5}[source]▶

>>44451083 #

No, there isn't. People talk about AGI, including the CEOs of frontier model companies, but this isn't a post about that; it's very specifically a post about the workaday applications of LLMs as they exist today. (I don't think AGI will ever exist and don't care about it either way.)

60. sothatsit ◴[03 Jul 25 02:57 UTC] No.44451212{3}[source]▶

>>44451021 #

Just writing a clear document, like you would for a person, gets you 95% of the way there. There are little tweaks you can do, but they don't matter as much as just being concise and factual, and structuring the document clearly. You just don't want the documentation to get too long.

61. rsanek ◴[03 Jul 25 03:19 UTC] No.44451330{3}[source]▶

>>44450634 #

they don't have to make it impossible, just expensive enough for it to not be worth it. using vision is a perfect example.

62. ramoz ◴[03 Jul 25 03:31 UTC] No.44451380[source]▶

>>44450160 (OP) #

Sorry, but you had a misleading experience with trash software (crew). Research Agent of all cases - the productionized high-value agent shipped to the masses by all providers.

Hard disagree with most of the narrative. Dont start with models, start with Claude Code. For any use case. Go from there depending on costs.

> When NOT to use agents

> Enterprise Automation

Archive this blog.

The real lesson is don't let any company other than the providers dictate what an agent is vs isnt.

Computer use agents are here, they are coming for the desktop of non-technical users, they will provide legitimate RPA capability and beyond, anyone productizing agents will build on top of provider sdks.

63. rglover ◴[03 Jul 25 03:32 UTC] No.44451385[source]▶

>>44450631 #

What is a wheel without its grease?

64. ilaksh ◴[03 Jul 25 03:41 UTC] No.44451429[source]▶

>>44450160 (OP) #

I think this was true late 2023 or early 2024, but not necessarily in mid 2025 for most tasks (as long as they require some AI and aren't purely automation) and you use SOTA LLMs.

I used to build the way most of his examples are just functions calling LLMs. I found it almost necessary due to poor tool selection etc. But I think the leading edge LLMs like Gemini 2.5 Pro and Claude 4 are smart enough and good enough at instruction following and tool selection that it's not necessarily better to create workflows.

I do have a checklist tool and delegate command and may break tasks down into separate agents though. But the advantage of creating instructions and assigning tool commands, especially if you have an environment with a UI where it is easy to assign tool commands to agents and otherwise define them, is that it is more flexible and a level of abstraction above something like a workflow. Even for visual workflows it's still programming which is more brittle and more difficult to dial in.

This was not the case 6-12 months ago and doesn't apply if you insist on using inferior language models (which most of them are). It's really only a handful that are really good at instruction following and tool use. But I think it's worth it to use those and go with agents for most use cases.

The next thing that will happen over the following year or two is going to be a massive trend of browser and computer use agents being deployed. That is again another level of abstraction. They might even incorporate really good memory systems and surely will have demonstration or observation modes that can extract procedures from humans using UIs. They will also learn (record) procedural details for optimization during exploration from verbal or written instructions.

replies(2): >>44451551 #>>44454323 #

65. prmph ◴[03 Jul 25 03:50 UTC] No.44451472{4}[source]▶

>>44451001 #

In other words, the old adage still applies: there is no free lunch.

More seriously, yes it makes sense that LLMs are not going to be able to take humans entirely out of the loop. Think about what it would mean if that were the case: if people, on the basis of a few simple prompts could let the agents loose and create sophisticated systems without any further input, the there would be nothing to differentiate those systems, and thus they would lose their meaning and value.

If prompting is indeed the new level of abstraction we are working at, then what value is added by asking Claude: make me a note-taking app? A million other people could also issue this same low-effort prompt; thus what is the value added here by the prompter?

66. prmph ◴[03 Jul 25 04:04 UTC] No.44451530[source]▶

>>44450569 #

Exactly. I made a similar comment as this elsewhere on this discussion:

The old adage still applies: there is no free lunch. It makes sense that LLMs are not going to be able to take humans entirely out of the loop.

Think about what it would mean if that were the case: if people, on the basis of a few simple prompts could let the agents loose and create sophisticated systems without any further input, the there would be nothing to differentiate those systems, and thus they would lose their meaning and value.

replies(1): >>44451673 #

67. bonzini ◴[03 Jul 25 04:08 UTC] No.44451551[source]▶

>>44451429 #

The techniques he has in the post are mostly "model your problem as a data flow graph and follow it".

If you skip the modeling part and rely on something that you don't control being good enough, that's faith not engineering.

replies(1): >>44451611 #

68. golergka ◴[03 Jul 25 04:14 UTC] No.44451577{3}[source]▶

>>44451021 #

I've had very good experience with building a very architecture-conscious folder structure and putting AGENTS.md in every folder (and, of course, instruction to read _and_ update those in the root prompt). But with Agent-written docs I also have to run doc maintainer agent pretty often.

replies(1): >>44452037 #

69. golergka ◴[03 Jul 25 04:19 UTC] No.44451588{4}[source]▶

>>44451016 #

First GPT-based app I've built was in summer 2022, right after I got API access to GPT-3, and I was writing first autonomous GPT wrapper right after I got GPT-4 access in February GPT-3. It didn't have "tools" it could use, but it had memory and it was (clumsily) moving along a pre-defined chat scenario. And I'm nowhere near top AI researchers who have got to have had close access much earlier — so I have absolutely no doubt there's got to be people who have been writing exactly what we now call "agents" for 3 years straight.

70. djabatt ◴[03 Jul 25 04:19 UTC] No.44451590{3}[source]▶

>>44450544 #

I agree with your point. After working with LLMs and building apps with them for the past four years, I consider myself a veteran and perhaps an authority (to some) on the subject. I find developing programs that use LLMs both fascinating and frustrating. Nevertheless, I'm going to continue with my work and curiosities, and let the industry change the names of what I'm doing—whether it's called agent development, context engineering, or whatever comes next.

71. ilaksh ◴[03 Jul 25 04:23 UTC] No.44451611{3}[source]▶

>>44451551 #

I didn't say to skip any kind of problem modeling. I just didn't emphasize it.

The goal _should_ be to avoid doing traditional software engineering or create a system that requires typical engineering to maintain.

Agents with leading edge LLMs allow smart users to have flexible systems that they can evolve by modifying instructions and tools. This requires less technical skill than visual programming.

If you are only taking advantage of the LLM to handle a few wrinkles or a little bit of natural language mapping then you aren't really taking advantage of what they can do.

Of course you can build systems with rigid workflows and sprinkling of LLM integration, but for most use cases it's probably not the right default mindset for mid-2025.

Like I said, I was originally following that approach a little ways back. But things change. Your viewpoint is about a year out of date.

replies(1): >>44451756 #

72. ModernMech ◴[03 Jul 25 04:24 UTC] No.44451616{3}[source]▶

>>44450945 #

It's funny because things are finally coming full circle in ML.

10-15 years ago the challenge in ML/PR was "feature engineering", the careful crafting of rules that would define features in the data which would draw the attention of the ML algorithm.

Then deep learning came along and it solved the issue of feature engineering; just throw massive amounts of data at the problem and the ML algorithms can discern the features automatically, without having to craft them by hand.

Now we've gone as far as we can with massive data, and the problem seems to be that it's difficult to bring out the relevent details when there's so much data. Hence "context engineering", a manual, heuristic-heavy processes guided by trial and error and intuition. More an art than science. Pretty much the same thing that "feature engineering" was.

73. chamomeal ◴[03 Jul 25 04:36 UTC] No.44451673{3}[source]▶

>>44451530 #

I’ve been thinking about that too! If you can only make an app by “vibe coding” it, then anybody else in the world with internet access can make it, too!

Although sometimes the difficult part is knowing what to make, and LLMs are great for people who actually know what they want, but don’t know how to do it

74. aryehof ◴[03 Jul 25 04:48 UTC] No.44451709[source]▶

>>44450160 (OP) #

I have long felt that deterministic business processes are not suited to LLM orchestration. Isn’t the author in this article expressing this discovery?

75. bonzini ◴[03 Jul 25 04:58 UTC] No.44451756{4}[source]▶

>>44451611 #

I understand that. You didn't answer the important point, which is that you can't be sure that what you have works if you don't encode the process. And encoding the processes isn't really software engineering; abstractions for business rules management have existed for decades and can be reused in this context.

You're YOLOing it, and okay that may be fine but may also be a colossal mistake, especially if you remove or never had a human in the loop.

replies(1): >>44451794 #

76. babuloseo ◴[03 Jul 25 05:00 UTC] No.44451767[source]▶

>>44450160 (OP) #

Peddling course ROFL

77. lmeyerov ◴[03 Jul 25 05:05 UTC] No.44451792[source]▶

>>44450160 (OP) #

I like the decision diagram :)

The callout on enterprise automation is interesting b/c it's one of the $T sized opportunities that matters most here, and while I think the article is right in the small, I now think quite differently in the large for what ultimately matters here. Basically, we're crossing the point where one agent written in natural language can easily be worth ~100 python scripts and be much shorter at the same time.

For context, I work with teams in operational enterprise/gov/tech co teams like tier 1+2 security incident response, where most 'alerts' don't get seriously investigated as underresourced & underautomated teams have to just define them away. Basically every since gpt4, it's been pretty insane figuring this stuff out with our partners here. As soon as you get good at prompt templates / plans with Claude Code and the like to make them spin for 10min+ productively, this gets very obvious.

Before agents:

Python workflows and their equivalent. They do not handle variety & evolution because they're hard-coded. Likewise, they only go so far on a task because they're brain dead. Teams can only crank out + maintain so many.

After agents:

You can easily sketch out 1 investigation template in natural language that literally goes 10X wider + 10X deeper than the equiv of Python code, including Python AI workflows. You are now handling much more of the problem.

78. ilaksh ◴[03 Jul 25 05:07 UTC] No.44451794{5}[source]▶

>>44451756 #

What I suggested was to use an actual agent. I also did not say there was no human in the loop.

The process is encoded in natural language and tool options.

I'm not YOLOing anything.

replies(1): >>44455750 #

79. aabaker99 ◴[03 Jul 25 05:15 UTC] No.44451845{6}[source]▶

>>44450872 #

What’s the best state management in Flutter?

replies(1): >>44452024 #

80. lmm ◴[03 Jul 25 05:38 UTC] No.44451954{5}[source]▶

>>44450591 #

Yes. The worst company I worked for was the one that allowed the guy who was a programming expert from like 30 years ago to make all important decisions.

81. heavyset_go ◴[03 Jul 25 05:44 UTC] No.44451992{3}[source]▶

>>44450601 #

Look at what happens whenever models are updated or new models come out: previous "good" prompts might not return the expected results.

What's good prompting for one model can be bad for another.

82. mindwok ◴[03 Jul 25 05:50 UTC] No.44452024{7}[source]▶

>>44451845 #

Oh no we've wandered into a flamewar...

I like Bloc the most!

83. OccamsMirror ◴[03 Jul 25 05:51 UTC] No.44452032[source]▶

>>44450542 #

The funny thing is that a lot of people want the AI to scrape their public website. The same people that likely wrote all of their marketing content with ChatGPT.

84. troupo ◴[03 Jul 25 05:53 UTC] No.44452037{4}[source]▶

>>44451577 #

> and putting AGENTS.md in every folder (and, of course, instruction to read _and_ update those in the root prompt).

For me, Claude Code completely ignores the instruction to read and follow AGENTS.md, and I have to remind it every time.

The joys of non-deterministic blackboxes.

85. troupo ◴[03 Jul 25 05:55 UTC] No.44452051{3}[source]▶

>>44450601 #

> knowing how the models behave, simply providing more/better context and instructions can fill this gap

No.

--- start quote ---

prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:

- training set

- weights

- constraints on the model

- layers between you and the model that transform both your input and the model's output that can change at any time

- availability of compute for your specific query

- and definitely some more details I haven't thought of

https://dmitriid.com/prompting-llms-is-not-engineering

--- end quote ---

replies(2): >>44453280 #>>44453872 #

86. imhoguy ◴[03 Jul 25 06:14 UTC] No.44452134{3}[source]▶

>>44450609 #

you can build AI unicorn without AI even: builder.ai /s

replies(1): >>44452878 #

87. imhoguy ◴[03 Jul 25 06:18 UTC] No.44452156[source]▶

>>44450452 #

I think that world will be abandoned by most of the sane people. Who personaly loves AI output? It it next level enshitification engine, see example in the article - spam...cough...salesbot.

88. Animats ◴[03 Jul 25 06:38 UTC] No.44452278{3}[source]▶

>>44450768 #

That's what's so funny about this.

Spamming is not only obnoxious, but a very weak example. Spamming is so error tolerant that if 30% of the output is totally wrong, the sender won't notice. Response rates are usually very low. This is a singularly un-demanding problem.

You don't even need "AI" for this. Just score LinkedIn profiles based on keywords, and if the score is high enough, send a spam. Draft a few form letters, and send the one most appropriate for the keywords. Probably would have about the same reply rate.

89. Voloskaya ◴[03 Jul 25 06:39 UTC] No.44452288{3}[source]▶

>>44450544 #

“Agent” involves having agency. Calling the GPT-3 API and asking it to do some classification or whatever else your use case was, would not be considered agentic. Not only were there no tools back then to allow an LLM to carry out a plan of its own, even if you had developed your own, GPT-3 still sucked way too much to trust it with even basic tasks.

I have been working on LLMs since 2017, both training some of the biggest and then creating products around them and consider I have no experience with agents.

replies(2): >>44452529 #>>44453334 #

90. evertedsphere ◴[03 Jul 25 06:40 UTC] No.44452293[source]▶

>>44450160 (OP) #

Another day, another blog post chock full of LLM tells on the front page.

91. noosphr ◴[03 Jul 25 07:28 UTC] No.44452529{4}[source]▶

>>44452288 #

All llms still suck too much to trust them with basic tasks without human in the loop. The only people who don't realize this are the ones whose paycheck depends on them not understanding it.

replies(1): >>44452587 #

92. Voloskaya ◴[03 Jul 25 07:38 UTC] No.44452587{5}[source]▶

>>44452529 #

I don't necessarily disagree, my point is more that today you can realistically let an agent do several steps and use several tools, following a plan of it's own, before doing a manual review (e.g. Claude Code followed by a PR review). After all an intern has agency, even if I'm going to double check everything they do.

GPT-3, while being impressive at the time, was too bad to even let it do that, it would break after 1 or 2 steps, so letting it do anything by itself would have been a waste of time where the human in the loop would always have to re-do everything. It's planning ability was too bad and hallucinations way to frequent to be useful in those scenarios.

93. artpar ◴[03 Jul 25 07:47 UTC] No.44452639{3}[source]▶

>>44451021 #

I am using these files (most of them are llm generated based on my prompt to reduce its lookups when working on a codebase)

https://gist.github.com/artpar/60a3c1edfe752450e21547898e801...

(specially the AGENT.knowledge is quite helpful)

replies(1): >>44452869 #

94. leCaptain ◴[03 Jul 25 08:04 UTC] No.44452737[source]▶

>>44450160 (OP) #

there is no mention of which model was used to build the agent. for all we know the author could have used qwen3 0.6 q4.

it would be helpful to know which models where used in each scenario, otherwise this can largely be ignored

95. moritz64 ◴[03 Jul 25 08:19 UTC] No.44452834[source]▶

>>44450552 #

> steering an agent via a test suite is an extremely powerful reinforcement mechanism

can you elaborate a bit? how do you proceed? what does your process look like?

replies(1): >>44454041 #

96. HumanOstrich ◴[03 Jul 25 08:25 UTC] No.44452869{4}[source]▶

>>44452639 #

Can you provide any form of demonstration of an LLM reading these files and acting accordingly? Do you know how each item added affects its behavior?

I'd also be interested in your process for creating these files, such as examples of prompts, tools, and references for your research.

replies(1): >>44453253 #

97. gsky ◴[03 Jul 25 08:27 UTC] No.44452878{4}[source]▶

>>44452134 #

Some media outlets pushed that for clicks. According to an employee who worked there that they built some ai models and used AI.

98. intellectronica ◴[03 Jul 25 08:28 UTC] No.44452886[source]▶

>>44450160 (OP) #

100% agree - agents are exciting and fun to play with, but to get real work done and get real productivity improvements, orchestrating specific workflows and processes and using AI to do things only AI can do, is the right approach.

99. artpar ◴[03 Jul 25 09:32 UTC] No.44453253{5}[source]▶

>>44452869 #

claude doesn't read them reliably and has to be reminded across sessions. I ususally do @AGENT.main and @AGENT.knowledge and it figures out the rest. Over the period of doing this claude is able to maintain the "project management" part itself, as in terms of "whats the current state of the project" and "what are the next ideal todos and how to go about them".

> Can you provide any form of demonstration of an LLM reading these files and acting accordingly

claude does update them at the end of the session (i say wrap up on prompt). the ones you are seeing in that gist are original forms, they evolve with each commit.

100. ◴[03 Jul 25 09:36 UTC] No.44453280{4}[source]▶

>>44452051 #

101. nunodonato ◴[03 Jul 25 09:46 UTC] No.44453334{4}[source]▶

>>44452288 #

In defense of the previous commenter, I also started with GPT3. I had tool calling and reasoning before chatgpt even came out. So yeah, there was a lot that could be done before the models started integrating it

replies(1): >>44453778 #

102. blks ◴[03 Jul 25 10:43 UTC] No.44453646[source]▶

>>44450552 #

Sounds like you are spending more time battling with your own tools than doing actual work.

replies(1): >>44454005 #

103. dmezzetti ◴[03 Jul 25 10:54 UTC] No.44453709{3}[source]▶

>>44451032 #

For full disclosure, I work on txtai one of the more popular AI frameworks out there. So this checks out :)

104. Voloskaya ◴[03 Jul 25 11:06 UTC] No.44453778{5}[source]▶

>>44453334 #

> I had tool calling and reasoning before chatgpt even came out.

Do you know of any kind of write up (by you or someone else) on this topic? Admittedly I never really spent too much time on this since I was working on pre-training, but I did try to do a few smart things with it and it pretty much failed at every thing, in big part because it wasn't even instruction tuned, so was very much still an autocomplete model.

So would be curious to learn more about how people got it to succeeed at agentic behaviors.

replies(1): >>44456677 #

105. A4ET8a8uTh0_v2 ◴[03 Jul 25 11:24 UTC] No.44453872{4}[source]▶

>>44452051 #

I think you are being unfairly downvoted as you raise a valid point. The real question is whether 'prompt engineering' has an edge over 'human resource management' ( as this is the obvious end goal here ). At this time, the answer is relatively simple, but I am not certain it will remain so.

106. mccoyb ◴[03 Jul 25 11:44 UTC] No.44454005{3}[source]▶

>>44453646 #

Ah yes, everything has to be about getting work done, right? You always have to be productive!

Do you think, just maybe, it might be interesting to play around with these tools without worrying about how productive you're being?

replies(2): >>44455339 #>>44455637 #

107. mccoyb ◴[03 Jul 25 11:48 UTC] No.44454041{3}[source]▶

>>44452834 #

I spend a significant amount of time (a) curating the test suite, and making sure it matches my notion of correctness and (b) forcing the agent to make PNG visuals (which Claude Code can see, by the way, and presumably also Gemini CLI, and maybe Aider?, etc)

I'd have to do this anyways, if I was writing the code myself, so this is not "time above what I'd normally spend"

The visuals it makes for me I can inspect and easily tell if it is on the right path, or wrong. The test suite is a sharper notion of "this is right, this is wrong" -- more sharp than just visual feedback and my directions.

The basic idea is to setup a feedback loop for the agent, and then keep the agent in the loop, and observe what it is doing. The visuals are absolutely critical -- as a compressed representation of the behavior of the codebase, which I can quickly and easily parse and recognize if there are issues.

108. clbrmbr ◴[03 Jul 25 12:30 UTC] No.44454323[source]▶

>>44451429 #

I agree that the strongest agentic models (Claude Opus 4 in particular) change the calculus. They still need good context, but damn are they good at reaching for the right tool.

109. eadmund ◴[03 Jul 25 12:53 UTC] No.44454512{3}[source]▶

>>44450544 #

> If you can't be an expert after 5 years then there is no such thing as experts.

I think you’ll find that after 10 years one’ll look back on oneself at 5 years’ experience and realise that one wasn’t an expert back then. The same is probably true of 20 years looking back on 10.

Given a median career of about 40 years, I think it’s fair to estimate that true expertise takes at least 10–15 years.

110. pancsta ◴[03 Jul 25 13:10 UTC] No.44454666[source]▶

>>44450160 (OP) #

I can barely agree with anything in this article.

> most agent systems break down from too much complexity, not too little

...when the platform wasn't made to handle complexity. The main problem is that the "frameworks" are not good enough for agentic workloads, which naturally will scale into complex stateful chaos. This requires another approach, but all that is done is delegating this to LLMs. As the author says "A coordinator agent that managed task delegation", which is the wrong way, an easy exit, like "maybe it will vibe-state itself?".

Agentic systems existed before LLMs (check ABM), and nowadays most ppl confuse what LLMs give us (all-knowing subconscious DBs) with agency, which is a purpose of completing a process. Eg a bus driver is an agent, but you dont ask a bus driver to play the piano. It has predefined behavior, within a certain process.

Another common mistake is considering a prompt (with or without history) an agent. It's just a DB model which you query. A deep research agent has 3 prompts: Check if an answer is possible, scrape, and answer. These are NOT 3 agents - these are DB queries. Delegating logical decisions to LLMs without verification is like having a drunk bus driver. A new layer is needed, which is where all the python frameworks offer it on top of their prompts. That's a mistake, because it splits the control flow, and managing complex state with FSMs or imperative code will soon hit a wall.

Declarative programming to the rescue - this is the only (and also natural) way of handling live and complex systems. It has to be done from the bottom up and it will change the paradigm of the whole agent. I've worked on this exact approach for a while now, and besides handling complexity, the 2nd challenge is navigating through it easily, to find answers to your questions (what and when, exactly, went wrong). I let LLMs "build" the dynamic parts of the agent (like planning), but keeping them in IoC - only the agent layer makes decisions. Another important thing - small prompts, with a single task; 100 focused prompts is better then 1 pasta-prompt. Again, without a proper control flow, synchronizing 100 co-dependent prompts can be tricky (when approached imperatively, with eg a simple loop).

Theres more to it, and I recommend checking out my agents (research and cook), either as a download, source code, or a video walk-through [0].

PS. Embrace chaos, and the chaos will embrace you.

TLDR; toy-frameworks in python, ppl avoiding coding, drunk LLMs

[0] https://github.com/pancsta/secai

111. RealityVoid ◴[03 Jul 25 14:19 UTC] No.44455339{4}[source]▶

>>44454005 #

No. They're tools, not pets, and since everyone is raving how good these tools are, I expect to be able to use them as tools.

My ideea of a good time is understanding the system in depth and building it while trusting it does what I expect. This is going away, though.

112. rpcorb ◴[03 Jul 25 14:49 UTC] No.44455637{4}[source]▶

>>44454005 #

Yes, actually. Everything is about getting work done. Delivering value. If you're spending more time playing with your tools vs. using your tools to deliver value, that's not success at your job. Play around on your own time.

replies(1): >>44456268 #

113. bonzini ◴[03 Jul 25 15:00 UTC] No.44455750{6}[source]▶

>>44451794 #

If there is a human in the loop, TFA does say that agents can be the solution. In fact that's pretty much the conclusion that the author makes.

By saying "you should just use agents", anyone who has read the article will assume that you're talking about the case where there's no human in the loop.

114. southernplaces7 ◴[03 Jul 25 15:50 UTC] No.44456268{5}[source]▶

>>44455637 #

Since they didn't mention whether they play around at work or on their own free time, what the hell are you even talking about? And no, not everything is about pushing to be productive and deliver some dry corporate HR turd of a definition about value.

If anything such tedious obsessions can just cloud a person's mind against creating something interesting that does turn out to also have long-term import. I mean, I assume i'm talking to either a troll or an idiot given the weird rant you replied with, but it's good to remember that value doesn't always come in a specifically molded form.

replies(1): >>44466007 #

115. nunodonato ◴[03 Jul 25 16:25 UTC] No.44456677{6}[source]▶

>>44453778 #

I used to vlog my experiments. Not really a very scientific write up on the topic, mostly just ramblings while experimenting cool stuff

116. ◴[04 Jul 25 16:46 UTC] No.44466007{6}[source]▶

>>44456268 #

117. pierrebrunelle ◴[04 Jul 25 18:35 UTC] No.44466848[source]▶

>>44450160 (OP) #

I fully agree and this is why context engineering matters and unifying storage and orchestration and treating agents as just another function call is significant and getting full visibility into the pipeline to easily iterate and with with the I/O. This is a good sample implementation of that: https://github.com/pixeltable/pixelbot

↑