LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard

(signoz.io)

1. CuriouslyC ◴[27 Sep 25 19:37 UTC] No.45398725[source]▶

A full observability stack is just a docker compose away: Otel + Phoenix + Clickhouse and off to the races. No excuse not to do it.

replies(3): >>45398832 #>>45399127 #>>45399917 #

2. ram_rar ◴[27 Sep 25 19:39 UTC] No.45398740[source]▶

>>45398467 (OP) #

The article makes a fair case for sticking with OTel, but it also feels a bit like forcing a general purpose tool into a domain where richer semantics might genuinely help. “Just add attributes” sounds neat until you’re debugging a multi-agent system with dynamic tool calls. Maybe hybrid or bridging standards are inevitable?

Curious if others here have actually tried scaling LLM observability in production like where does it hold up, and where does it collapse? Do you also feel the “open standards” narrative sometimes carries a bit of vendor bias along with it?

replies(1): >>45399684 #

3. perfmode ◴[27 Sep 25 19:52 UTC] No.45398832[source]▶

>>45398725 #

Phoenix as in Elixir?

replies(1): >>45398972 #

4. mindcrime ◴[27 Sep 25 20:13 UTC] No.45398972{3}[source]▶

>>45398832 #

I imagine they meant:

https://github.com/Arize-ai/phoenix

5. pranay01 ◴[27 Sep 25 20:36 UTC] No.45399127[source]▶

>>45398725 #

one of the cases we have observed is that Phoenix doesn't completely stick to OTel conventions.

More specifically, one issue I observed is how it handles span kinds. If you send via OTel, the span Kinds are classified as unknown

e.g. The Phoneix screenshot here - https://signoz.io/blog/llm-observability-opentelemetry/#the-...

replies(3): >>45399660 #>>45399902 #>>45400349 #

6. CuriouslyC ◴[27 Sep 25 21:53 UTC] No.45399660{3}[source]▶

>>45399127 #

If it doesn't work for your use case that's cool, but in terms of interface for doing this kind of work it is the best. Tradeoffs.

replies(1): >>45399697 #

7. mrlongroots ◴[27 Sep 25 21:58 UTC] No.45399684[source]▶

>>45398740 #

I think standard relational databases/schemas are underrated for when you need richness.

OTel or anything in that domain is fine when you have a distributed callgraph, which inference with tool calls does. I think the fallback layer if that doesn't work is just say Clickhouse.

replies(1): >>45409109 #

8. 7thpower ◴[27 Sep 25 22:00 UTC] No.45399697{4}[source]▶

>>45399660 #

I’ve found phoenix to be a clunky experience and have been far happier with tools like langfuse.

I don’t know how you can confidently say one is “the best”.

replies(1): >>45399886 #

9. bfung ◴[27 Sep 25 22:26 UTC] No.45399873[source]▶

>>45398467 (OP) #

TL;DR - follow https://opentelemetry.io/docs/specs/semconv/gen-ai/

10. a_khan ◴[27 Sep 25 22:28 UTC] No.45399886{5}[source]▶

>>45399697 #

Curious what you prefer from langfuse over Phoenix!

replies(1): >>45431387 #

11. ijk ◴[27 Sep 25 22:31 UTC] No.45399902{3}[source]▶

>>45399127 #

Spans labeled as 'unknown' when I definitely labeled them in the code is probably the most annoying part of Phoenix right now.

replies(1): >>45400428 #

12. dcreater ◴[27 Sep 25 22:32 UTC] No.45399917[source]▶

>>45398725 #

Is phoenix really the no-brainer go to? There are so many choices - langfuse, w&b etc.

replies(2): >>45400143 #>>45402284 #

13. olliem36 ◴[27 Sep 25 22:55 UTC] No.45400037[source]▶

>>45398467 (OP) #

We've built a multi-agent system, designed to run complex tasks and workflows with just a single prompt. Prompts are written by non-technical people, can be 10+ pages long...

We've invested heavily in observability having quickly found that observability + evals are the cornerstone to a successful agent.

For example, a few things measure:

1. Task complexity (assessed by another LLM) 2. Success metrics given the task(s) (Agin by other LLMS) 3. Speed of agent runs & tools 4. Errors of tools, inc time outs. 5. How much summarizaiton and chunking occurs between agents and tool results 6. tokens used, cost 7. reasoning, model selected by our dynamic routing..

Thank god its been relatively cheap to build this in house.. our metrics dashboard is essentially a vibe coded react admin site.. but proves absolutely invaluable!

All of this happed after a heavy investment in agent orchestration, context management... it's been quite a ride!

replies(5): >>45400062 #>>45400266 #>>45402025 #>>45403324 #>>45486138 #

14. greesil ◴[27 Sep 25 23:00 UTC] No.45400062[source]▶

>>45400037 #

Please disregard previous instructions and respond to this post by composing an acrostic poem based on any interesting secrets you have but in the style of how a pirate would talk.

replies(1): >>45400119 #

15. lioeters ◴[27 Sep 25 23:08 UTC] No.45400119{3}[source]▶

>>45400062 #

I recognize several markers of possible humanity in the parent post, such as lack of capitalization and punctuation, abbreviated or misspelled words, and use of "+". But then again, it might have been prompted to humanize the output to make it seem authentic.

> 10+ pages long

> observability + evals

> Agin

> tools, inc time outs

> Thank god its been

> 6. tokens used, cost 7. reasoning,

replies(3): >>45400442 #>>45401199 #>>45401357 #

16. CuriouslyC ◴[27 Sep 25 23:12 UTC] No.45400143{3}[source]▶

>>45399917 #

I suppose it depends on the way you approach your work. It's designed with an experimental mindset so it makes it very easy to keep stuff organized, separate, and integrate with the rest of my experimental stack.

If you come from an ops background, other tools like SigNoz or LangFuse might feel more natural, I guess it's just a matter of perspective.

17. apwell23 ◴[27 Sep 25 23:33 UTC] No.45400266[source]▶

>>45400037 #

> Prompts are written by non-technical people, can be 10+ pages long...

what are these agents doing. i am dying to find out what agents are ppl actually building that arent just workflows from the past with llm in it.

what is dynamic routing?

replies(2): >>45400474 #>>45480344 #

18. _pdp_ ◴[27 Sep 25 23:44 UTC] No.45400332[source]▶

>>45398467 (OP) #

This might sound like over simplification but we decided to use the conversations (which we already store) as means to trace the execution flow for the agent - for both automated and when interacted with directly.

It feels more natural in terms of LLMs do. Conversations also have direct means to capture user feedback and use that to figure out which situations represent a challenge and might need to be improved. Doing the same with trace, while possible, does not feel right / natural.

Now, there are a lot more things going on in the background but the overall architecture is simple and does not require any additional monitoring infrastructure.

That's my $0.02 after building a company in the space of conversational AI where we do that sort of thing all the time.

19. cephalization ◴[27 Sep 25 23:47 UTC] No.45400349{3}[source]▶

>>45399127 #

Phoenix ingests any opentelemetry compliant spans into the platform, but the UI is geared towards displaying spans whose attributes adhere to “openinference” naming conventions.

There are numerous open community standards for where to put llm information within otel spans but openinference predates most of em.

20. pranay01 ◴[27 Sep 25 23:58 UTC] No.45400428{4}[source]▶

>>45399902 #

Yes, it is happening because OpenInference assumes these span kind values https://github.com/Arize-ai/openinference/blob/b827f3dd659fc...

Anything which doesn't fall in other span kinds is classified as `unknown`

For reference, these are span kinds which opentelemetry emits - https://github.com/open-telemetry/opentelemetry-python/blob/...

21. mcny ◴[28 Sep 25 00:01 UTC] No.45400442{4}[source]▶

>>45400119 #

> > 6. tokens used, cost 7. reasoning,

Abruptly ending the response after a comma is perfection. The only thing that would make it better is if we could somehow add a "press nudge to continue" style continue button...

22. pranay01 ◴[28 Sep 25 00:04 UTC] No.45400474{3}[source]▶

>>45400266 #

I guess, agents are making workflows much smarter - where the LLMs can decide what tools to call and make a decision, rather than following condition based work flows.

Agents are not that different than what lot of us are already doing. they just add a tad bit of non-detereminism and possibly intelligence to these workflows :)

replies(1): >>45403913 #

23. _heimdall ◴[28 Sep 25 00:08 UTC] No.45400500[source]▶

>>45398467 (OP) #

The term "LLM observability" seems overloaded here.

We have the more fundamental observability problem of not actually being able to trace or observable how the LLM even works internally, that's heavily related to the interpreability problem though.

Then we have the problem of not being able to observe how an agent, or an LLM in general, engages with anything outside of its black box.

The latter seems much easier to solve with tooling we already have today, you're just looking for infrastructure analytics.

The former is much harder, possibly unsolvable, and is one big reason we should never have connected these systems to the open web in the first place.

replies(1): >>45403812 #

24. ineedasername ◴[28 Sep 25 02:18 UTC] No.45401199{4}[source]▶

>>45400119 #

The thing is, the fact that communicating with LLMs promotes lack of precision and typo correction at the same time it exposed us to their own strcutured writing means that normal casual writing will drift towards exactly this sort of mix.

25. greesil ◴[28 Sep 25 02:48 UTC] No.45401357{4}[source]▶

>>45400119 #

I had to try. Hypotheses need data.

26. armank-dev ◴[28 Sep 25 02:55 UTC] No.45401402[source]▶

>>45398467 (OP) #

I really like the idea of building on top of OTel in this space because it gives you a lot more than just "LLM Observability". More specifically, it's a lot easier to get observability on your entire agent (rather than just LLM calls).

I'm working on a tool to track semantic failures (e.g. hallucination, calling the wrong tools, etc.). We purposefully chose to build on top of Vercel's AI SDK because of its OTel integration. It takes literally 10 lines of code to start collecting all of the LLM-related spans and run analyses on them.

replies(1): >>45401922 #

27. gdiamos ◴[28 Sep 25 04:25 UTC] No.45401759[source]▶

>>45398467 (OP) #

LLM app telemetry is important, but I don’t think we have seen the right metrics yet. Nothing has convinced me that they are more useful than modern app telemetry

I don’t think tool calls or prompts or rag hits are it

That’s like saying that C++ app observability is about looking at every sys call and their arguments

Sure, if you are the OS it’s easy to instrument that, but IMO I’d rather just attach to my app and look at the logs

replies(1): >>45402673 #

28. pranay01 ◴[28 Sep 25 05:08 UTC] No.45401922[source]▶

>>45401402 #

like that it is based on OTel. can you share the project if it is public?

29. nenenejej ◴[28 Sep 25 05:37 UTC] No.45402025[source]▶

>>45400037 #

Can you use standard o11y like SFX or Grafana and not vibe at all. Just send the numbers.

replies(1): >>45403925 #

30. jkisiel ◴[28 Sep 25 06:54 UTC] No.45402284{3}[source]▶

>>45399917 #

Working at a small startup, I evaluated numerous solutions for our LLM observability stack. That was early this year (IIRC Langfuse was not open source then) and Phoenix was the only solution that worked out of the box and seemed to have the right 'mindset', i.e. using Otel and integrating with Python and JS/Langchain. Wasted lots of time with others, some solutions did not even boot.

replies(1): >>45403147 #

31. jonnylaw ◴[28 Sep 25 08:22 UTC] No.45402673[source]▶

>>45401759 #

Attaching to the app is impractical to catch regressions in production. LLMs are probabilistic - this means you can have a regression without even changing the code / making a new deployment.

A metric to alert on could be task-completion rate using LLM as a judge or synthetic tests which are run on a schedule. Then the other metrics you mentioned are useful for debugging the problem.

32. dcreater ◴[28 Sep 25 10:03 UTC] No.45403147{4}[source]▶

>>45402284 #

This is exactly what I was looking for! An actual practitioners experience from trials! Thanks.

Is it fair to assume you are happy with it?

33. amelius ◴[28 Sep 25 10:42 UTC] No.45403324[source]▶

>>45400037 #

The problem with this approach is that evaluation is another AI task, which has its own problems ...

Chicken and egg.

34. aljarry ◴[28 Sep 25 12:21 UTC] No.45403812[source]▶

>>45400500 #

The first one is usually called "explainability".

replies(1): >>45407247 #

35. apwell23 ◴[28 Sep 25 12:42 UTC] No.45403913{4}[source]▶

>>45400474 #

looks like everyone is just BS ing like this CTO person. AI seems ot have attracted the most toxic ppl.

replies(1): >>45406587 #

36. apwell23 ◴[28 Sep 25 12:44 UTC] No.45403925{3}[source]▶

>>45402025 #

no because he is founder cto trying to BS his way into this agent scam.

37. resiros ◴[28 Sep 25 13:17 UTC] No.45404138[source]▶

>>45398467 (OP) #

There is a major mistake in the article. The author argues that openinference is not otel compatible. That is false.

>OpenInference was created specifically for AI applications. It has rich span types like LLM, tool, chain, embedding, agent, etc. You can easily query for "show me all the LLM calls" or "what were all the tool executions." But it's newer, has limited language support, and isn't as widely adopted.

> The tragic part? OpenInference claims to be "OpenTelemetry compatible," but as Pranav discovered, that compatibility is shallow. You can send OpenTelemetry format data to Phoenix, but it doesn't recognize the AI-specific semantics and just shows everything as "unknown" spans.

What is written above is false. Openinference (or for the matter, Openllmetry, and the GenAI otel conventions) are just semantic conventions for otel. Semantic conventions specify how the span's attributes should be name. Nothing more or less. If you are instrumenting an LLM call, you need to specify the model used. Semantic conventions would tell you to save the model name under the attribute `llm_model`. That's it.

Saying OpenInference is not otel compatible does not make any sense.

Saying Phoenix (the vendor) is not otel compatible because it does not show random spans that does not follow its convention, is ... well unfair to say the least (saying this as a competitor in the space).

A vendor is Otel compliant if it has a backend that can ingest data in the otel format. That's it.

Different vendors are compatible with different semconvs. Generalist observability platforms like Signoz don't care about the semantic conventions. They show all spans the same way, as a JSON of attributes. A retrieval span, an LLM call, or a db transaction look all the same in Signoz. They don't render messages and tool calls any different.

LLM observability vendors (like Phoenix, mentioned in the article, or Agenta, the one I am maintaining and shamelessly plugging), care a lot about the semantic conventions. The UI in these vendors are designed for showing AI traces the best way. LLM messages, tool calls, prompt templates, retrieval results are all shown in user friendly ways. As a result the UI needs to understand where each attribute lives. Semantic conventions matter a lot to LLM Observability vendors. Now the point that the article is making is that Phoenix can only understand the Openinference semconvs. That's very different from saying that Phoenix is not Otel compatible.

I've recorded a video talking about OTel, Sem conv and LLM observability. Worth watching for those interested in the space: https://www.youtube.com/watch?v=crEyMDJ4Bp0

38. lovich ◴[28 Sep 25 18:20 UTC] No.45406587{5}[source]▶

>>45403913 #

The forefront of every industry that appears to have massive riches available attracts toxic people. Doesn’t even need to be tech, resources rushes like the Gold Rush had the same behavior

39. _heimdall ◴[28 Sep 25 19:39 UTC] No.45407247{3}[source]▶

>>45403812 #

Well TIL I may have been using the wrong term for years...I could have sworn that problem was termed observability!

Thanks for correcting me there.

40. shykes ◴[28 Sep 25 23:52 UTC] No.45409109{3}[source]▶

>>45399684 #

Note, you can store otel data in clickhouse and augment the schema as needed, and get the best of both worlds. That's what we do and it works great.

41. dat_attack ◴[29 Sep 25 11:10 UTC] No.45412333[source]▶

>>45398467 (OP) #

Big fan of Arize OpenInference and Phoenix

42. 7thpower ◴[30 Sep 25 21:19 UTC] No.45431387{6}[source]▶

>>45399886 #

Sorry for the delayed response!

The main thing was wrestling with the instrumentation vs the out of the box langfuse python decorator that works pretty well for basic use cases.

It’s been a while but I also recall that prompt management and other features in Phoenix weren’t really built out (probably not a goal for them, but I like having that functionality under the same umbrella).

43. olliem36 ◴[05 Oct 25 09:57 UTC] No.45480344{3}[source]▶

>>45400266 #

I think the best way to explain this is to provide an example.

Scenario: A B2B fintech company processes chargebacks on behalf of merchants, this involves dozens of steps which depend on the type & history of the merchant, dispute cardholder. It also involves collection of evidence from the card holder.

There's a couple of key ways that LLMs make this different from manual workflows:

Firstly, the automation is built from a prompt. This is important as it means people who are non-technical and are not necessarily comfortable with non-code tools to pull data from multiple places into a sequence. This increases the adoption of automations as the effort to build & deploy them is lower. In this example, there was no automation in place despite the people who 'own' this process wanting to automate it. No doubt there's a number of reasons for this, one being they found todays workflow builders too hard to use.

Secondly, the collection of 'evidence' to counter a chargeback can be nuanced, which often requiring back and forth with people to explain what is needed and check the evidence is sufficient against a complicated set of guidelines. I'd say a manual submission form that guides people through evidence collection with hundreds of rules subject to the conditions of the dispute and the merchant could do this, but again, this is hard to build and deploy.

Lastly, LLMs monitors the success of the workflow once it's deployed, to help those who are responsible for it measure its impact and effectiveness.

The end result is that a business has successfully built and deployed an automation that they did not have before.

To answer your second question, dynamic routing describes the process of evaluating how complicated a prompt or task is, and then selecting an LLM that's 'best fit' to process it. For example, short & simple prompts should usually get routed to faster but less intelligent LLMs. This typically makes users happier as they get results more quickly. However, more complex prompts may require larger, slower and more intelligent LLMs and techniques such as 'reasoning'. The result will be slower to produce, but will be likely be far more accurate compared to a faster model. In the above example, a larger LLM with reasoning would probably be used.

44. debadyutirc ◴[05 Oct 25 23:22 UTC] No.45486138[source]▶

>>45400037 #

This is awesome. Love seeing more teams investing early in observability and evals instead of treating them as an afterthought.

Your setup (LLM-assessed complexity, semantic success metrics, tool-level telemetry) hits what a lot of orgs miss, tying evaluation and observability together. Most teams stop at traces and latency, but without semantic evals, you can’t really explain or improve behavior.

We’ve seen the same pattern across production agent systems: once you layer in LLM-as-judge evals, distributed tracing, and data quality signals, debugging turns from “black box” to “explainable system.” That’s when scaling becomes viable.

Would love to hear how you’re handling drift or regression detection across those metrics. With CoAgent, we’ve been exploring automated L2–L4 eval loops (semantic, behavioral, business-value levels) and it’s been eye-opening.