Span = an event (which is bascially just a log with an associated trace), and some data fields. Trace = a log for a request with a unique Id.
A useful thing about opentelemetry is that there's auto-instrumentation so you can get this all out-of-the-box for most JVM apps. Of course you could probably log your queries instead, so it's not necessarily a game-changer but a nice-to-have.
Also the standardization is nice.
As I understand it, a metric is information at a point in time.
A span however has a start timestamp and end timestamp, and is about a single operation that happens across that time.
https://opentelemetry.io/docs/specs/otel/metrics/
vs
We run with Debug logging on in prod for that reason too. We also ingest insane amounts of data but it does seem to be worth it for a sufficiently complex and important enough system to really have it all.
I haven't been asked this question ever. In a way, I wish I was. I wish leadership was engaged in the details of the capabilities of the systems they lead.
But I don't anyone asking me this question any time soon either.
Imo spans and logs should be understood as the same and displayed and queried the same (it’s trivial to add span id to each log), it almost feels like people are trying to make something trivially simple seem more substantial or complex
A while ago I was working on some CUDA kernels for n-body physics simulations. It wasn’t too complicated and the end result was generative art. The problem was that it was quite slow and I didn’t know why. Well the core of the application was written in Clojure so I wrote a simple macro to wrap every function in a ns with a span and then ship all the data to jaeger. This ended up being exactly what I needed - I found out that the two slowest functions were data transfer between the GPU memory and writing out a frame (image) to my disk.
In many other places I see the usefulness of this approach but OTel is too often too geared towards HTTP services. Even simple async/queue processing is not as simple. Though, there have been improvements (like span links and trace links).
Logs are point in time, spans are a duration. Logs are flat, spans have a hierarchy.
It's the difference between logging a message in a function, and logging the beginning and end of a function while noting the specific instance of the fn caller.
If you have many threads or callers to the same function that difference is critical in tracing causality of failures or any other type of action of note.
Ive spent countless hours on issues where customers complain about performance or a bug and it just can’t be reproduced. Telemetry allows us to get more information to locate and fix these issues.
Feedback welcome!
You should have an answer, right? Like, in your case, you run a lot of logging, and you know why. So if it's off, you say "because it would cost X/million dollars a year and we decided not to do it."
Course, if you're the one who set it up, you should have the receipts on when that decision was made. This can be tricky sometimes because a lot of software dev ICs are strangely insulated from direct budgets, but if you're presented with an option that would be helpful but would cost a ton of money, it's generally a good thing to at least quickly run by someone higher up to confirm the desired direction.
I have a similar issue with Prometheus -- not great for batch job metrics either. It's frustrating how many otherwise excellent OSS tools are optimized for web applications but fall short for batch processing use cases.
Works well for us, I’m not sure I understand the issue you’re facing?
1. Why didn't we catch this sooner
2. Why did it take so long to mitigate
Without the debug logging #2 can be really tricky sometimes as well as you can be flying blind to some deep internal conditional branch firing off.
CLI -> Web API Gateway -> Lambda returning a signed S3 URL S3 upload -> SQS -> Lambda which writes to S3 and updates a Dynamo record -> CLI polls for changes
This flow isn't only over HTTP and relies on AWS to fire events. I worked around this by embedding the trace ID into the signed URL metadata. It doesn't look like this is possible with all AWS services.
I wonder if X-Ray can help here?
It can also be tedious to initialize spans everywhere. Aspects could help a lot here and orchestrion [0] is a good example of how it could be done in Go. I haven't found an OTEL equivalent yet (though haven't looked hard).
[0] - https://datadoghq.dev/orchestrion/docs/architecture/#code-in...
maybe the first time i read a crystal clear difference between metrics, logs and traces.
nice post.
I think this does highlight, to me, the biggest weakness of OTel--the actual documentation and examples for "how to solve problems with this" really suck.
In short, logs can do everything spans can do, depending on how you use them. So really there isn't much distinction.
I'd say that spans and logs are about 95% similar, whereas metrics are wildly dissimilar to both.
Most tools would be be better if they treated spans and logs as two nearly-identical things:
- logs should be viewable in a hierarchy if they have a span id associated - spans should be queryable and countable and alarmable and dashboardable with the same tools as logs
How do you mean? The metrics are available for 15 days by default. What exactly are you missing?