←back to thread

116 points ndhandala | 1 comments | | HN request time: 0s | source
Show context
drivenextfunc ◴[] No.45085422[source]
Has anyone used OpenTelemetry for long-running batch jobs? OTel seems designed for web apps where spans last seconds/minutes, but batch jobs run for hours or days. Since spans are only submitted after completion, there's no way to track progress during execution, making OTel nearly unusable for batch workloads.

I have a similar issue with Prometheus -- not great for batch job metrics either. It's frustrating how many otherwise excellent OSS tools are optimized for web applications but fall short for batch processing use cases.

replies(5): >>45086098 #>>45087174 #>>45092745 #>>45103972 #>>45107467 #
ekzy ◴[] No.45087174[source]
I’ve implemented OTEL for background jobs, so async jobs that get picked up from the DB where I store the trace context in the DB and pass it along to multiple async jobs. For some jobs that fail and retry with a backoff strategy, they can take many hours and we can see the traces fine in grafana. Each job create its own span but they are all within the same trace.

Works well for us, I’m not sure I understand the issue you’re facing?

replies(1): >>45087236 #
1. ekzy ◴[] No.45087236[source]
Ok after re reading I think you have issues with long running spans, I think you should break down your spans in smaller chunks. But a trace can take many hours or days, and be analysed even when it’s not finished