Most active commenters

zug_zug(3)

Popular/hot comments

>>45082515 #
>>45085422 #
>>45082843 #
>>45083511 #
>>45083636 #

What Are Traces and Spans in OpenTelemetry?

(oneuptime.com)

1. lucketone ◴[31 Aug 25 10:00 UTC] No.45081941[source]▶

>>45038570 (OP) #

Nice summary at the start.

2. N_Lens ◴[31 Aug 25 10:41 UTC] No.45082128[source]▶

>>45038570 (OP) #

OTEL as a set of standards is admirable and ambitious, though in my experience actual implementation differs significantly between different vendors and they all seem to overcomplicate it.

replies(1): >>45082689 #

3. geoffbp ◴[31 Aug 25 10:46 UTC] No.45082143[source]▶

>>45038570 (OP) #

Good article. thanks for sharing

4. psnehanshu ◴[31 Aug 25 12:01 UTC] No.45082515[source]▶

>>45038570 (OP) #

The amount of additional code that it needs is horrible. We will now have to spend more brain juice on telemetry when working on a feature.

replies(6): >>45082539 #>>45082656 #>>45082860 #>>45083156 #>>45084271 #>>45084719 #

5. bavell ◴[31 Aug 25 12:05 UTC] No.45082539[source]▶

>>45082515 #

Nah, if you have an important application this is very low cost for adding tons of insight into how your app is running.

6. SirHackalot ◴[31 Aug 25 12:26 UTC] No.45082656[source]▶

>>45082515 #

It’s really not that bad, integrating it with dashboards is where I found most of the difficulty to be (due to bad documentation). I spent 4 days on implementing observability for this new backend project I’m working on. OTEL logging, tracing, and metric emission took less than a day to implement, instrumentation was very well documented. When I tried to integrate with Grafana dashboards, that’s when things started getting pretty frustrating…

7. eurekin ◴[31 Aug 25 12:31 UTC] No.45082689[source]▶

>>45082128 #

Plus that tens of terabytes of data you have to store for a week worth of traces

replies(1): >>45083511 #

8. zug_zug ◴[31 Aug 25 13:03 UTC] No.45082843[source]▶

>>45038570 (OP) #

This is sort of all just a reframing of existing technologies.

Span = an event (which is bascially just a log with an associated trace), and some data fields. Trace = a log for a request with a unique Id.

A useful thing about opentelemetry is that there's auto-instrumentation so you can get this all out-of-the-box for most JVM apps. Of course you could probably log your queries instead, so it's not necessarily a game-changer but a nice-to-have.

Also the standardization is nice.

replies(3): >>45083789 #>>45083847 #>>45084887 #

9. matsemann ◴[31 Aug 25 13:05 UTC] No.45082860[source]▶

>>45082515 #

I don't really agree. It's mostly setup done once. Like configuring it and for example attaching some span generator to the library you use to talk with the database. Then future queries get it "for free". And just a single line if you want something custom if you have an annotation in java or using with in python for instance.

10. rednafi ◴[31 Aug 25 13:23 UTC] No.45082979[source]▶

>>45038570 (OP) #

What clicked for me is:

A span is a key-value attribute about some point in time event

A trace is a DAG of spans that tells you a story about some related events

replies(1): >>45083131 #

11. mihaitodor ◴[31 Aug 25 13:35 UTC] No.45083059[source]▶

>>45038570 (OP) #

While I do like the comprehensive writeup, there's something about the style which triggers my "it's AI-generated" reflex...

12. eterm ◴[31 Aug 25 13:48 UTC] No.45083131[source]▶

>>45082979 #

What do you mean exactly by "point in time event"?

As I understand it, a metric is information at a point in time.

A span however has a start timestamp and end timestamp, and is about a single operation that happens across that time.

https://opentelemetry.io/docs/specs/otel/metrics/

https://opentelemetry.io/docs/specs/otel/trace/api/#span

13. gazpacho ◴[31 Aug 25 13:52 UTC] No.45083156[source]▶

>>45082515 #

I work for Pydantic. We make Logfire, a commercial OTEL backend. But we’ve made wrappers around the OTEL SDKs in various languages that simplify configuration and usage. They can be used with any OTEL compatible backend (although we’d love if you try our SaaS offering): - JavaScript / Typescript: https://github.com/pydantic/logfire-js - Rust: https://github.com/pydantic/logfire-rust - Python: https://github.com/pydantic/logfire

14. c2h5oh ◴[31 Aug 25 14:42 UTC] No.45083511{3}[source]▶

>>45082689 #

That's why you sample just enough instead of storing everything

replies(3): >>45083636 #>>45083681 #>>45083814 #

15. voidfunc ◴[31 Aug 25 14:56 UTC] No.45083636{4}[source]▶

>>45083511 #

That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?

We run with Debug logging on in prod for that reason too. We also ingest insane amounts of data but it does seem to be worth it for a sufficiently complex and important enough system to really have it all.

replies(3): >>45084058 #>>45084941 #>>45094434 #

16. eurekin ◴[31 Aug 25 15:00 UTC] No.45083681{4}[source]▶

>>45083511 #

We do. 0.5%

17. jeffbee ◴[31 Aug 25 15:12 UTC] No.45083789[source]▶

>>45082843 #

I always preach the isomorphism between traces and logs, but you left out the key thing. A span is a log entry associated with a trace, but the other key attributes of the span are its own unique identifier and a reference to the other event that caused the event. With those three attributes you can interpret the trace as a casual graph.

replies(1): >>45084086 #

18. vlovich123 ◴[31 Aug 25 15:15 UTC] No.45083814{4}[source]▶

>>45083511 #

Sampling unconditionally at the start of the request is worth less than sampling at the end (so that your sample 1% of successful traces and be 100% of traces with issues).

19. tnolet ◴[31 Aug 25 15:20 UTC] No.45083847[source]▶

>>45082843 #

yeah, but spans can have events!

replies(1): >>45090892 #

20. alkonaut ◴[31 Aug 25 15:40 UTC] No.45084033[source]▶

>>45038570 (OP) #

Trying to use OTel in any scenario outside of web backends such as desktop is a frustrating exercise in to trying to find exactly what small subset should use. I wish they had more examples of other types of software.

replies(1): >>45084120 #

21. evidencetamper ◴[31 Aug 25 15:42 UTC] No.45084058{5}[source]▶

>>45083636 #

> and leadership asks why you weren't logging everything in full fidelity?

I haven't been asked this question ever. In a way, I wish I was. I wish leadership was engaged in the details of the capabilities of the systems they lead.

But I don't anyone asking me this question any time soon either.

replies(1): >>45085449 #

22. zug_zug ◴[31 Aug 25 15:46 UTC] No.45084086{3}[source]▶

>>45083789 #

True. I think I’m emphasizing their similarities because what I’m seeing is companies treating them as unrelated (eg splunk and signalfx making entirely different query languages and visualization tools for logs vs spans)

Imo spans and logs should be understood as the same and displayed and queried the same (it’s trivial to add span id to each log), it almost feels like people are trying to make something trivially simple seem more substantial or complex

replies(1): >>45084268 #

23. sdedovic ◴[31 Aug 25 15:51 UTC] No.45084120[source]▶

>>45084033 #

I agree. An anecdote:

A while ago I was working on some CUDA kernels for n-body physics simulations. It wasn’t too complicated and the end result was generative art. The problem was that it was quite slow and I didn’t know why. Well the core of the application was written in Clojure so I wrote a simple macro to wrap every function in a ns with a span and then ship all the data to jaeger. This ended up being exactly what I needed - I found out that the two slowest functions were data transfer between the GPU memory and writing out a frame (image) to my disk.

In many other places I see the usefulness of this approach but OTel is too often too geared towards HTTP services. Even simple async/queue processing is not as simple. Though, there have been improvements (like span links and trace links).

24. BoiledCabbage ◴[31 Aug 25 16:11 UTC] No.45084268{4}[source]▶

>>45084086 #

Traces and spans can be extended from or added to existing logging, but they aren't the same.

Logs are point in time, spans are a duration. Logs are flat, spans have a hierarchy.

It's the difference between logging a message in a function, and logging the beginning and end of a function while noting the specific instance of the fn caller.

If you have many threads or callers to the same function that difference is critical in tracing causality of failures or any other type of action of note.

replies(1): >>45105816 #

25. pm90 ◴[31 Aug 25 16:11 UTC] No.45084271[source]▶

>>45082515 #

There’s certainly some overhead, nothing is free. But the tradeoff is better insight into your system and better tools to validate issues when they arise. It can be very powerful in those scenarios.

Ive spent countless hours on issues where customers complain about performance or a bug and it just can’t be reproduced. Telemetry allows us to get more information to locate and fix these issues.

26. andoando ◴[31 Aug 25 16:50 UTC] No.45084667[source]▶

>>45038570 (OP) #

Is there anything that wraps multiple requests?

replies(1): >>45085733 #

27. diegojromero ◴[31 Aug 25 16:55 UTC] No.45084719[source]▶

>>45082515 #

Thanks for your comment! It has given me an idea for a project: a simple library that provides a Python decorator that can be used to include basic telemetry for functions in Python code (think spans with the input parameters as attributes): https://github.com/diegojromerolopez/otelize

Feedback welcome!

28. bboreham ◴[31 Aug 25 17:12 UTC] No.45084887[source]▶

>>45082843 #

Span has a beginning and an end time. Event typically just has a time when it happened.

29. majormajor ◴[31 Aug 25 17:18 UTC] No.45084941{5}[source]▶

>>45083636 #

> That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?

You should have an answer, right? Like, in your case, you run a lot of logging, and you know why. So if it's off, you say "because it would cost X/million dollars a year and we decided not to do it."

Course, if you're the one who set it up, you should have the receipts on when that decision was made. This can be tricky sometimes because a lot of software dev ICs are strangely insulated from direct budgets, but if you're presented with an option that would be helpful but would cost a ton of money, it's generally a good thing to at least quickly run by someone higher up to confirm the desired direction.

30. drivenextfunc ◴[31 Aug 25 18:08 UTC] No.45085422[source]▶

>>45038570 (OP) #

Has anyone used OpenTelemetry for long-running batch jobs? OTel seems designed for web apps where spans last seconds/minutes, but batch jobs run for hours or days. Since spans are only submitted after completion, there's no way to track progress during execution, making OTel nearly unusable for batch workloads.

I have a similar issue with Prometheus -- not great for batch job metrics either. It's frustrating how many otherwise excellent OSS tools are optimized for web applications but fall short for batch processing use cases.

replies(5): >>45086098 #>>45087174 #>>45092745 #>>45103972 #>>45107467 #

31. no_wizard ◴[31 Aug 25 18:10 UTC] No.45085449{6}[source]▶

>>45084058 #

Have you ever been asked “why didn’t we catch this sooner?”. I feel like it’s the same question worded differently

replies(1): >>45089399 #

32. ◴[31 Aug 25 18:25 UTC] No.45085593[source]▶

>>45038570 (OP) #

33. mdaniel ◴[31 Aug 25 18:37 UTC] No.45085733[source]▶

>>45084667 #

I doubt "wraps" but almost certainly what you're shopping for is a correlation identifier on the (logs, traces, metrics) that would enable you to group the related requests. Sometimes just the session id can get you where you want to go, but in more complicated setups you may have to annotate from the client side to indicate "I'm doing these 5 things as part of this one logical operation"

34. nickzelei ◴[31 Aug 25 19:12 UTC] No.45086098[source]▶

>>45085422 #

Hm from what I’ve seen it emits metrics at a regular interval just like Prometheus. Maybe I’m thinking of something else though.

35. ekzy ◴[31 Aug 25 21:16 UTC] No.45087174[source]▶

>>45085422 #

I’ve implemented OTEL for background jobs, so async jobs that get picked up from the DB where I store the trace context in the DB and pass it along to multiple async jobs. For some jobs that fail and retry with a backoff strategy, they can take many hours and we can see the traces fine in grafana. Each job create its own span but they are all within the same trace.

Works well for us, I’m not sure I understand the issue you’re facing?

replies(1): >>45087236 #

36. ekzy ◴[31 Aug 25 21:24 UTC] No.45087236{3}[source]▶

>>45087174 #

Ok after re reading I think you have issues with long running spans, I think you should break down your spans in smaller chunks. But a trace can take many hours or days, and be analysed even when it’s not finished

37. voidfunc ◴[01 Sep 25 04:17 UTC] No.45089399{7}[source]▶

>>45085449 #

Its really two questions:

1. Why didn't we catch this sooner

2. Why did it take so long to mitigate

Without the debug logging #2 can be really tricky sometimes as well as you can be flying blind to some deep internal conditional branch firing off.

38. digianarchist ◴[01 Sep 25 08:00 UTC] No.45090585[source]▶

>>45038570 (OP) #

I've been tasked with adding telemetry to an AWS based service at work:

CLI -> Web API Gateway -> Lambda returning a signed S3 URL S3 upload -> SQS -> Lambda which writes to S3 and updates a Dynamo record -> CLI polls for changes

This flow isn't only over HTTP and relies on AWS to fire events. I worked around this by embedding the trace ID into the signed URL metadata. It doesn't look like this is possible with all AWS services.

I wonder if X-Ray can help here?

It can also be tedious to initialize spans everywhere. Aspects could help a lot here and orchestrion [0] is a good example of how it could be done in Go. I haven't found an OTEL equivalent yet (though haven't looked hard).

[0] - https://datadoghq.dev/orchestrion/docs/architecture/#code-in...

replies(1): >>45092756 #

39. freetonik ◴[01 Sep 25 08:59 UTC] No.45090892{3}[source]▶

>>45083847 #

So, events are recursive?

40. QuiCasseRien ◴[01 Sep 25 10:02 UTC] No.45091268[source]▶

>>45038570 (OP) #

> Metrics tell you what changed. Logs tell you why something happened. Traces tell you where time was spent and how a request moved across your system.

maybe the first time i read a crystal clear difference between metrics, logs and traces.

nice post.

41. scottgg ◴[01 Sep 25 14:01 UTC] No.45092745[source]▶

>>45085422 #

You could use span links for this. The idea is you have a bunch of discrete traces that indicate they are downstream or upstream of some other trace. You’d just have to bend it a bit to work in your probably single process batch executor !

42. scottgg ◴[01 Sep 25 14:02 UTC] No.45092756[source]▶

>>45090585 #

There’s an OTel SIG to do something similar / based on orchestrion and some other prior art - so just a matter of time !

43. TYPE_FASTER ◴[01 Sep 25 16:59 UTC] No.45094434{5}[source]▶

>>45083636 #

I’ve used feature flags to manage logging verbosity and sample rate. It’s really nice to be able to go from logging very little to incrementally pump up the volume when there’s an incident.

44. nucleardog ◴[02 Sep 25 14:58 UTC] No.45103972[source]▶

>>45085422 #

Nothing running for days, but sometimes a half hour or so. When the process kicks off it starts a trace, but individual steps of the process create separate spans within that trace (and sometimes further nested spans) that don't run the entire length of the job. As the job progresses, the spans and their related events, logs, etc all appear.

I think this does highlight, to me, the biggest weakness of OTel--the actual documentation and examples for "how to solve problems with this" really suck.

45. zug_zug ◴[02 Sep 25 17:01 UTC] No.45105816{5}[source]▶

>>45084268 #

Logs can represent point in time, or spans, or anything you choose. Logs can have the span-id attached to them (and normally should), so they are hierarchical.

In short, logs can do everything spans can do, depending on how you use them. So really there isn't much distinction.

I'd say that spans and logs are about 95% similar, whereas metrics are wildly dissimilar to both.

Most tools would be be better if they treated spans and logs as two nearly-identical things:

- logs should be viewable in a hierarchy if they have a span id associated - spans should be queryable and countable and alarmable and dashboardable with the same tools as logs

46. dvfjsdhgfv ◴[02 Sep 25 18:55 UTC] No.45107467[source]▶

>>45085422 #

> I have a similar issue with Prometheus -- not great for batch job metrics either.

How do you mean? The metrics are available for 15 days by default. What exactly are you missing?

↑