Just take any example and think how a human would break it down with decision trees.
You are building an AI system to respond to your email.
The first agent decides whether the new email should be responded to, yes or no.
If no, it can send it to another LLM call that decides to archive it or leave it in the inbox for the human.
If yes, it sends it to classifier that decides what type of response is required.
Maybe there are some emails like for your work that require something brief like “congrats!” to all those new feature launch emails you get internally.
Or others that are inbound sales emails that need to go out to another system that fetches product related knowledge to craft a response with the right context. Followed by a checker call that makes sure the response follows brand guidelines.
The point is all of these steps are completely hypothetical but you can imagine how loosely providing some set of instructions and function calls and procedural limits can easily classify things and minimize error rate.
You can do this for any workflow by creatively combining different function calls, recursion, procedural limits, etc. And if you build multiple different decision trees/workflows, you can A/B test those and use LLM-as-a-judge to score the performance. Especially if you’re working on a task with lots of example outputs.
As for trusted environments, assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good. I put mine in their own cloudflare workers where they can’t do any damage beyond giving an odd response to the user.