←back to thread

116 points ndhandala | 7 comments | | HN request time: 0.323s | source | bottom
1. drivenextfunc ◴[] No.45085422[source]
Has anyone used OpenTelemetry for long-running batch jobs? OTel seems designed for web apps where spans last seconds/minutes, but batch jobs run for hours or days. Since spans are only submitted after completion, there's no way to track progress during execution, making OTel nearly unusable for batch workloads.

I have a similar issue with Prometheus -- not great for batch job metrics either. It's frustrating how many otherwise excellent OSS tools are optimized for web applications but fall short for batch processing use cases.

replies(5): >>45086098 #>>45087174 #>>45092745 #>>45103972 #>>45107467 #
2. nickzelei ◴[] No.45086098[source]
Hm from what I’ve seen it emits metrics at a regular interval just like Prometheus. Maybe I’m thinking of something else though.
3. ekzy ◴[] No.45087174[source]
I’ve implemented OTEL for background jobs, so async jobs that get picked up from the DB where I store the trace context in the DB and pass it along to multiple async jobs. For some jobs that fail and retry with a backoff strategy, they can take many hours and we can see the traces fine in grafana. Each job create its own span but they are all within the same trace.

Works well for us, I’m not sure I understand the issue you’re facing?

replies(1): >>45087236 #
4. ekzy ◴[] No.45087236[source]
Ok after re reading I think you have issues with long running spans, I think you should break down your spans in smaller chunks. But a trace can take many hours or days, and be analysed even when it’s not finished
5. scottgg ◴[] No.45092745[source]
You could use span links for this. The idea is you have a bunch of discrete traces that indicate they are downstream or upstream of some other trace. You’d just have to bend it a bit to work in your probably single process batch executor !
6. nucleardog ◴[] No.45103972[source]
Nothing running for days, but sometimes a half hour or so. When the process kicks off it starts a trace, but individual steps of the process create separate spans within that trace (and sometimes further nested spans) that don't run the entire length of the job. As the job progresses, the spans and their related events, logs, etc all appear.

I think this does highlight, to me, the biggest weakness of OTel--the actual documentation and examples for "how to solve problems with this" really suck.

7. dvfjsdhgfv ◴[] No.45107467[source]
> I have a similar issue with Prometheus -- not great for batch job metrics either.

How do you mean? The metrics are available for 15 days by default. What exactly are you missing?