We've invested heavily in observability having quickly found that observability + evals are the cornerstone to a successful agent.
For example, a few things measure:
1. Task complexity (assessed by another LLM) 2. Success metrics given the task(s) (Agin by other LLMS) 3. Speed of agent runs & tools 4. Errors of tools, inc time outs. 5. How much summarizaiton and chunking occurs between agents and tool results 6. tokens used, cost 7. reasoning, model selected by our dynamic routing..
Thank god its been relatively cheap to build this in house.. our metrics dashboard is essentially a vibe coded react admin site.. but proves absolutely invaluable!
All of this happed after a heavy investment in agent orchestration, context management... it's been quite a ride!