LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard

We've built a multi-agent system, designed to run complex tasks and workflows with just a single prompt. Prompts are written by non-technical people, can be 10+ pages long...

We've invested heavily in observability having quickly found that observability + evals are the cornerstone to a successful agent.

For example, a few things measure:

1. Task complexity (assessed by another LLM) 2. Success metrics given the task(s) (Agin by other LLMS) 3. Speed of agent runs & tools 4. Errors of tools, inc time outs. 5. How much summarizaiton and chunking occurs between agents and tool results 6. tokens used, cost 7. reasoning, model selected by our dynamic routing..

Thank god its been relatively cheap to build this in house.. our metrics dashboard is essentially a vibe coded react admin site.. but proves absolutely invaluable!

All of this happed after a heavy investment in agent orchestration, context management... it's been quite a ride!

This is awesome. Love seeing more teams investing early in observability and evals instead of treating them as an afterthought.

Your setup (LLM-assessed complexity, semantic success metrics, tool-level telemetry) hits what a lot of orgs miss, tying evaluation and observability together. Most teams stop at traces and latency, but without semantic evals, you can’t really explain or improve behavior.

We’ve seen the same pattern across production agent systems: once you layer in LLM-as-judge evals, distributed tracing, and data quality signals, debugging turns from “black box” to “explainable system.” That’s when scaling becomes viable.

Would love to hear how you’re handling drift or regression detection across those metrics. With CoAgent, we’ve been exploring automated L2–L4 eval loops (semantic, behavioral, business-value levels) and it’s been eye-opening.