Just to emphasize your point, below is a workflow I wrote for an LLM recently, to do language tagging (e.g., of vocab, grammar structures, etc). It's very different than what you'd think of as an "agent", where the LLM has tools and can take initiative.
LLMs are amazingly powerful in some ways, but without this kind of "scaffolding", simply not reliable enough to make consistent choices.
---
1. Here are: a) a "language schema" describing what kinds of tags I want and why, with examples, b) The text I want you to tag c) A list of previously-defined tags which could potentially be relevant (simple string match)
List for yourself which pre-existing tags you plan to use when doing tagging.
[LLM generates a list of tags]
2. Here is a,b,c from above, and d) your own tag list
Please write a draft tag.
[LLM writes a draft]
3. Here is a-d from above, plus e) your first draft, and f) Some programmatically-generated "linter" warnings which may or may not be violations of the schema.
Please check over your draft to make sure it follows the schema.
[LLM writes a new draft]
Agent checks for "hard" rules, like making sure there's a 1-1 correlation between the text and the tags. If no rules are violated move to step 5.
4. Here is a-e from above, plus g) your most recent draft, and h) known rule violations. Please fix the errors.
[LLM writes a new draft]
Repeat 4 until no hard rules are broken.
5. [and so on]