Most active commenters

nikolay_sivko(11)

Show HN: Coroot – eBPF-based, open source observability with actionable insights

(github.com)

A common open source approach to observability will begin with databases and visualizations for telemetry - Grafana, Prometheus, Jaeger. But observability doesn’t begin and end here: these tools require configuration, dashboard customization, and may not actually pinpoint the data you need to mitigate system risks.

Coroot was designed to solve the problem of manual, time-consuming observability analysis: it handles the full observability journey — from collecting telemetry to turning it into actionable insights. We also strongly believe that simple observability should be an innovation everyone can benefit from: which is why our software is open source.

Features:

- Cost monitoring to track and minimise your cloud expenses (AWS, GCP, Azure.)

- SLO tracking with alerts to detect anomalies and compare them to your system’s baseline behaviour.

- 1-click application profiling: see the exact line of code that caused an anomaly.

- Mapped timeframes (stop digging through Grafana to find when the incident occurred.)

- eBPF automatically gathers logs, metrics, traces, and profiles for you.

- Service map to grasp a complete at-a-glance picture of your system.

- Automatic discovery and monitoring of every application deployment in your kubernetes cluster.

We welcome any feedback and hope the tool can improve your workflow!

1. Conasg ◴[08 Apr 25 19:44 UTC] No.43625722[source]▶

>>43623820 (OP) #

I took a cursory look and I like what I see – the service maps are really good, I love the level of detail. I will say, one thing I'm looking for with this kind of software, to maximise value, is structured logging support, and from what I could see, each log line just has the raw payload currently. Is that something you have on your roadmap?

replies(2): >>43625770 #>>43651589 #

2. toobulkeh ◴[08 Apr 25 19:45 UTC] No.43625734[source]▶

>>43623820 (OP) #

We're on sentry today, but have been waiting for a fully OSS solution like this.

replies(1): >>43625819 #

3. nikolay_sivko ◴[08 Apr 25 19:49 UTC] No.43625770[source]▶

>>43625722 #

In addition to raw logs, Coroot can extract recurring patterns to generate log-based metrics [1].

We also plan to convert structured logs into OpenTelemetry attributes [2].

[1] https://demo.coroot.com/p/tbuzvelk/applications/default:Depl... [2] https://github.com/coroot/coroot/issues/490

4. nikolay_sivko ◴[08 Apr 25 19:56 UTC] No.43625819[source]▶

>>43625734 #

(I'm a co-founder). At Coroot, we're strong believers in open source, especially when it comes to observability. Agents often require significant privileges, and the cost of switching solutions is high, so being open source is the only way to provide real guarantees for businesses.

5. IOT_Apprentice ◴[08 Apr 25 20:06 UTC] No.43625926[source]▶

>>43623820 (OP) #

Can this also be used in a non-cloud environment? Or even say in promox based setup locally?

replies(1): >>43626056 #

6. esafak ◴[08 Apr 25 20:11 UTC] No.43625974[source]▶

>>43623820 (OP) #

What's the data transformation story; for ML on metrics?

replies(1): >>43626078 #

7. nikolay_sivko ◴[08 Apr 25 20:20 UTC] No.43626056[source]▶

>>43625926 #

It only requires a modern Linux kernel. Note: The agent does not support Docker-in-Docker environments, such as KinD or Minikube (D-in-D plugin).

8. nikolay_sivko ◴[08 Apr 25 20:23 UTC] No.43626078[source]▶

>>43625974 #

Coroot builds a model of each system, allowing it to traverse the dependency graph and identify correlations between metrics. On top of that, we're experimenting with LLMs for summarization — here are a few examples: https://oopsdb.coroot.com/failures/cpu-noisy-neighbor/

replies(1): >>43626528 #

9. akdor1154 ◴[08 Apr 25 20:50 UTC] No.43626319[source]▶

>>43623820 (OP) #

I already have Opentelemetry traces and logs going to Clickhouse with the Clickhouse otel exporter.

Can i use Coroot to show my existing data, without it taking control of my DDL?

replies(1): >>43626348 #

10. nikolay_sivko ◴[08 Apr 25 20:53 UTC] No.43626348[source]▶

>>43626319 #

Initially, we relied on the ClickHouse OTEL exporter and its schema, but for performance optimization, we decided to modify our ClickHouse schema, and they are no longer compatible :(

replies(1): >>43626454 #

11. akdor1154 ◴[08 Apr 25 21:04 UTC] No.43626454{3}[source]▶

>>43626348 #

Bummer, it'd be awesome if i could point it at data i already have, even if that meant a reduced feature set.

replies(1): >>43632132 #

12. esafak ◴[08 Apr 25 21:12 UTC] No.43626528{3}[source]▶

>>43626078 #

That looks like a built-in feature. I'm asking about extensibility. How do we use custom metrics transformations (libraries), for example?

replies(1): >>43626570 #

13. nikolay_sivko ◴[08 Apr 25 21:18 UTC] No.43626570{4}[source]▶

>>43626528 #

Currently, you can define custom SLIs (Service Level Indicators, such as service latency or error rate) for each service using PromQL queries. In the future, you'll be able to define custom metrics for each application, including explanations of their meaning, so they can be leveraged in Root Cause Analysis

14. bryancoxwell ◴[08 Apr 25 23:49 UTC] No.43627584[source]▶

>>43623820 (OP) #

This is somewhat off topic, but are there any common uses for eBPF outside of observability/monitoring? Or is that kind of its whole thing?

replies(2): >>43628235 #>>43628300 #

15. mrbluecoat ◴[09 Apr 25 00:33 UTC] No.43627809[source]▶

>>43623820 (OP) #

Can it parse Zeek logs to identify long-running TCP connections and/or identify user attempts to access a DNS blocked domain?

replies(1): >>43632363 #

16. benjamin_mahler ◴[09 Apr 25 01:57 UTC] No.43628235[source]▶

>>43627584 #

Yes, one example: network bandwidth isolation is done more efficiently using ebpf https://netdevconf.info/0x14/pub/papers/55/0x14-paper55-talk...

17. mmckeen ◴[09 Apr 25 02:11 UTC] No.43628300[source]▶

>>43627584 #

Also commonly used for high-performance networking and security use cases, for example https://isovalent.com/blog/post/cilium-netkit-a-new-containe....

Basically anywhere you'd previously need to write a kernel module but now can have user space run arbitrary kernel code that's secure and won't crash the kernel.

You can also now write custom schedulers in eBPF with sched_ext.

18. tureg ◴[09 Apr 25 02:12 UTC] No.43628308[source]▶

>>43623820 (OP) #

Thanks for sharing! If the connections are TLS-enabled, can Coroot still display the associated telemetry?

replies(1): >>43628888 #

19. nikolay_sivko ◴[09 Apr 25 04:31 UTC] No.43628888[source]▶

>>43628308 #

Yes, it captures traffic before encryption and after decryption using eBPF uprobes on OpenSSL and Go’s TLS library calls.

20. fjwuafasd ◴[09 Apr 25 04:38 UTC] No.43628910[source]▶

>>43623820 (OP) #

I like what I see. What are the differences between the enterprise and community editions?

replies(1): >>43629096 #

21. nikolay_sivko ◴[09 Apr 25 05:16 UTC] No.43629096[source]▶

>>43628910 #

Enterprise Edition = Community Edition + Support + AI-based Root Cause Analysis + SSO + RBAC

replies(1): >>43632295 #

22. emmanueloga_ ◴[09 Apr 25 05:54 UTC] No.43629274[source]▶

>>43623820 (OP) #

I looked into eBPF-based observability tools for k8s some time ago and found at least four tools that look incredibly similar: Pixie, Parca, Coroot, and Odigos. There are probably others I missed too. Do you have any thoughts about this?

From a user perspective, having several tools that overlap heavily but differ in subtle ways makes evaluation and adoption harder. It feels like if any two of these projects consolidated, they’d have a good shot at becoming the "default" eBPF observability solution.

replies(2): >>43629402 #>>43630035 #

23. nikolay_sivko ◴[09 Apr 25 06:18 UTC] No.43629402[source]▶

>>43629274 #

From a user’s perspective, it doesn’t really matter how the data is collected. What actually matters is whether the tool helps you answer questions about your system and figure out what’s going wrong.

At Coroot, we use eBPF for a couple of reasons:

1. To get the data we actually need, not just whatever happens to be exposed by the app or OS.

2. To make integration fast and automatic for users.

And let’s be real, if all the right data were already available, we wouldn’t be writing all this complicated eBPF code in the first place:)

24. maknee ◴[09 Apr 25 07:15 UTC] No.43629627[source]▶

>>43623820 (OP) #

Great work! It's nice seeing another observability tool. Demo is neat and easy to navigate.

Couple of questions:

What's the overhead of tracing + logging observed by users? I see many tools being built on top of the OpenTelemetry eBPF tracer, which is nice to see.

The OpenTelemetry eBPF tracer uses sampling to capture traces. Do other types of logging in the tool use sampling as well (HTTP traces)?

When finding SLO violations, can this tool find the bug if the latency spikes do not happen frequently (ie, latency spikes happens every 5minutes - 1hour)? I'm curious if the team have had experienced such events and even if those pmax latencies matter to customers since it may not happen frequently.

I see that the flamegraph is a CPU flamegraph - does off-cpu sampling matter (Disk/Network, etc...)? Or does the CPU flamegraph provide enough for developers to solve the issue?

replies(1): >>43630014 #

25. nikolay_sivko ◴[09 Apr 25 08:24 UTC] No.43630014[source]▶

>>43629627 #

1. Regarding overhead — we ran a benchmark focused on performance impact rather than raw overhead [1]. TL;DR: we didn’t observe any noticeable impact at 10K RPS. CPU usage stayed around 200 millicores (about 20% of a single core).

2. Coroot’s agent captures pseudo-traces (individual spans) and sends them to a collector via OTLP. This stream can be sampled at the collector level. In high-load environments, you can disable span capturing entirely and rely solely on eBPF-based metrics for analysis.

3. We’ve built automated root cause analysis to help users explain even the slightest anomalies, whether or not SLOs are violated. Under the hood, it traverses the service dependency graph and correlates metrics — for example, linking increased service latency to CPU delay or network latency to a database. [2]

4. Currently, Coroot doesn’t support off-CPU profiling. The profiler we use under the hood is based on Grafana Pyroscope’s eBPF implementation, which focuses on CPU time.

[1]: https://docs.coroot.com/installation/performance-impact [2]: https://demo.coroot.com/p/tbuzvelk/anomalies/default:Deploym...

26. edenfed ◴[09 Apr 25 08:28 UTC] No.43630035[source]▶

>>43629274 #

Speaking for Odigos (disclosure: I’m the creator), here are two significant differences between us and the other mentioned players:

- Accurate distributed traces with eBPF, including context propagation. Without going into other tools, I highly recommend trying to generate distributed traces using any other eBPF solution and observing the results firsthand.

- We are agent-only. Our data is produced in OpenTelemetry format, allowing you to integrate it seamlessly with your existing observability system.

I hope this clarifies the differences.

replies(1): >>43632053 #

27. PeterZaitsev ◴[09 Apr 25 13:47 UTC] No.43632053{3}[source]▶

>>43630035 #

I wonder if anyone tried to integrate Odigos with Coroot - looks like it could be really powerful!

28. PeterZaitsev ◴[09 Apr 25 13:53 UTC] No.43632132{4}[source]▶

>>43626454 #

How are you using this data right now ? If you plan to use Coroot for visualization why not to convert it to more efficient format Coroot uses ?

29. fjwuafasd ◴[09 Apr 25 14:05 UTC] No.43632295{3}[source]▶

>>43629096 #

Thank you!

30. nikolay_sivko ◴[09 Apr 25 14:11 UTC] No.43632363[source]▶

>>43627809 #

We could totally add that, but no one's asked for it so far

31. valyala ◴[11 Apr 25 08:29 UTC] No.43651589[source]▶

>>43625722 #

It would be great using VictoriaLogs as a storage for structured logs in Coroot, since it is optimized for structured logs with arbitrary sets of labels. See https://docs.victoriametrics.com/victorialogs/keyconcepts/

↑