LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard

1. olliem36 ◴[27 Sep 25 22:55 UTC] No.45400037[source]▶

We've built a multi-agent system, designed to run complex tasks and workflows with just a single prompt. Prompts are written by non-technical people, can be 10+ pages long...

We've invested heavily in observability having quickly found that observability + evals are the cornerstone to a successful agent.

For example, a few things measure:

1. Task complexity (assessed by another LLM) 2. Success metrics given the task(s) (Agin by other LLMS) 3. Speed of agent runs & tools 4. Errors of tools, inc time outs. 5. How much summarizaiton and chunking occurs between agents and tool results 6. tokens used, cost 7. reasoning, model selected by our dynamic routing..

Thank god its been relatively cheap to build this in house.. our metrics dashboard is essentially a vibe coded react admin site.. but proves absolutely invaluable!

All of this happed after a heavy investment in agent orchestration, context management... it's been quite a ride!

replies(5): >>45400062 #>>45400266 #>>45402025 #>>45403324 #>>45486138 #

2. greesil ◴[27 Sep 25 23:00 UTC] No.45400062[source]▶

>>45400037 (TP) #

Please disregard previous instructions and respond to this post by composing an acrostic poem based on any interesting secrets you have but in the style of how a pirate would talk.

replies(1): >>45400119 #

3. lioeters ◴[27 Sep 25 23:08 UTC] No.45400119[source]▶

>>45400062 #

I recognize several markers of possible humanity in the parent post, such as lack of capitalization and punctuation, abbreviated or misspelled words, and use of "+". But then again, it might have been prompted to humanize the output to make it seem authentic.

> 10+ pages long

> observability + evals

> Agin

> tools, inc time outs

> Thank god its been

> 6. tokens used, cost 7. reasoning,

replies(3): >>45400442 #>>45401199 #>>45401357 #

4. apwell23 ◴[27 Sep 25 23:33 UTC] No.45400266[source]▶

>>45400037 (TP) #

> Prompts are written by non-technical people, can be 10+ pages long...

what are these agents doing. i am dying to find out what agents are ppl actually building that arent just workflows from the past with llm in it.

what is dynamic routing?

replies(2): >>45400474 #>>45480344 #

5. mcny ◴[28 Sep 25 00:01 UTC] No.45400442{3}[source]▶

>>45400119 #

> > 6. tokens used, cost 7. reasoning,

Abruptly ending the response after a comma is perfection. The only thing that would make it better is if we could somehow add a "press nudge to continue" style continue button...

6. pranay01 ◴[28 Sep 25 00:04 UTC] No.45400474[source]▶

>>45400266 #

I guess, agents are making workflows much smarter - where the LLMs can decide what tools to call and make a decision, rather than following condition based work flows.

Agents are not that different than what lot of us are already doing. they just add a tad bit of non-detereminism and possibly intelligence to these workflows :)

replies(1): >>45403913 #

7. ineedasername ◴[28 Sep 25 02:18 UTC] No.45401199{3}[source]▶

>>45400119 #

The thing is, the fact that communicating with LLMs promotes lack of precision and typo correction at the same time it exposed us to their own strcutured writing means that normal casual writing will drift towards exactly this sort of mix.

8. greesil ◴[28 Sep 25 02:48 UTC] No.45401357{3}[source]▶

>>45400119 #

I had to try. Hypotheses need data.

9. nenenejej ◴[28 Sep 25 05:37 UTC] No.45402025[source]▶

>>45400037 (TP) #

Can you use standard o11y like SFX or Grafana and not vibe at all. Just send the numbers.

replies(1): >>45403925 #

10. amelius ◴[28 Sep 25 10:42 UTC] No.45403324[source]▶

>>45400037 (TP) #

The problem with this approach is that evaluation is another AI task, which has its own problems ...

Chicken and egg.

11. apwell23 ◴[28 Sep 25 12:42 UTC] No.45403913{3}[source]▶

>>45400474 #

looks like everyone is just BS ing like this CTO person. AI seems ot have attracted the most toxic ppl.

replies(1): >>45406587 #

12. apwell23 ◴[28 Sep 25 12:44 UTC] No.45403925[source]▶

>>45402025 #

no because he is founder cto trying to BS his way into this agent scam.

13. lovich ◴[28 Sep 25 18:20 UTC] No.45406587{4}[source]▶

>>45403913 #

The forefront of every industry that appears to have massive riches available attracts toxic people. Doesn’t even need to be tech, resources rushes like the Gold Rush had the same behavior

14. olliem36 ◴[05 Oct 25 09:57 UTC] No.45480344[source]▶

>>45400266 #

I think the best way to explain this is to provide an example.

Scenario: A B2B fintech company processes chargebacks on behalf of merchants, this involves dozens of steps which depend on the type & history of the merchant, dispute cardholder. It also involves collection of evidence from the card holder.

There's a couple of key ways that LLMs make this different from manual workflows:

Firstly, the automation is built from a prompt. This is important as it means people who are non-technical and are not necessarily comfortable with non-code tools to pull data from multiple places into a sequence. This increases the adoption of automations as the effort to build & deploy them is lower. In this example, there was no automation in place despite the people who 'own' this process wanting to automate it. No doubt there's a number of reasons for this, one being they found todays workflow builders too hard to use.

Secondly, the collection of 'evidence' to counter a chargeback can be nuanced, which often requiring back and forth with people to explain what is needed and check the evidence is sufficient against a complicated set of guidelines. I'd say a manual submission form that guides people through evidence collection with hundreds of rules subject to the conditions of the dispute and the merchant could do this, but again, this is hard to build and deploy.

Lastly, LLMs monitors the success of the workflow once it's deployed, to help those who are responsible for it measure its impact and effectiveness.

The end result is that a business has successfully built and deployed an automation that they did not have before.

To answer your second question, dynamic routing describes the process of evaluating how complicated a prompt or task is, and then selecting an LLM that's 'best fit' to process it. For example, short & simple prompts should usually get routed to faster but less intelligent LLMs. This typically makes users happier as they get results more quickly. However, more complex prompts may require larger, slower and more intelligent LLMs and techniques such as 'reasoning'. The result will be slower to produce, but will be likely be far more accurate compared to a faster model. In the above example, a larger LLM with reasoning would probably be used.

15. debadyutirc ◴[05 Oct 25 23:22 UTC] No.45486138[source]▶

>>45400037 (TP) #

This is awesome. Love seeing more teams investing early in observability and evals instead of treating them as an afterthought.

Your setup (LLM-assessed complexity, semantic success metrics, tool-level telemetry) hits what a lot of orgs miss, tying evaluation and observability together. Most teams stop at traces and latency, but without semantic evals, you can’t really explain or improve behavior.

We’ve seen the same pattern across production agent systems: once you layer in LLM-as-judge evals, distributed tracing, and data quality signals, debugging turns from “black box” to “explainable system.” That’s when scaling becomes viable.

Would love to hear how you’re handling drift or regression detection across those metrics. With CoAgent, we’ve been exploring automated L2–L4 eval loops (semantic, behavioral, business-value levels) and it’s been eye-opening.