←back to thread

162 points openWrangler | 1 comments | | HN request time: 0.204s | source

A common open source approach to observability will begin with databases and visualizations for telemetry - Grafana, Prometheus, Jaeger. But observability doesn’t begin and end here: these tools require configuration, dashboard customization, and may not actually pinpoint the data you need to mitigate system risks.

Coroot was designed to solve the problem of manual, time-consuming observability analysis: it handles the full observability journey — from collecting telemetry to turning it into actionable insights. We also strongly believe that simple observability should be an innovation everyone can benefit from: which is why our software is open source.

Features:

- Cost monitoring to track and minimise your cloud expenses (AWS, GCP, Azure.)

- SLO tracking with alerts to detect anomalies and compare them to your system’s baseline behaviour.

- 1-click application profiling: see the exact line of code that caused an anomaly.

- Mapped timeframes (stop digging through Grafana to find when the incident occurred.)

- eBPF automatically gathers logs, metrics, traces, and profiles for you.

- Service map to grasp a complete at-a-glance picture of your system.

- Automatic discovery and monitoring of every application deployment in your kubernetes cluster.

We welcome any feedback and hope the tool can improve your workflow!

Show context
maknee ◴[] No.43629627[source]

Great work! It's nice seeing another observability tool. Demo is neat and easy to navigate.

Couple of questions:

What's the overhead of tracing + logging observed by users? I see many tools being built on top of the OpenTelemetry eBPF tracer, which is nice to see.

The OpenTelemetry eBPF tracer uses sampling to capture traces. Do other types of logging in the tool use sampling as well (HTTP traces)?

When finding SLO violations, can this tool find the bug if the latency spikes do not happen frequently (ie, latency spikes happens every 5minutes - 1hour)? I'm curious if the team have had experienced such events and even if those pmax latencies matter to customers since it may not happen frequently.

I see that the flamegraph is a CPU flamegraph - does off-cpu sampling matter (Disk/Network, etc...)? Or does the CPU flamegraph provide enough for developers to solve the issue?

replies(1): >>43630014 #
1. nikolay_sivko ◴[] No.43630014[source]

1. Regarding overhead — we ran a benchmark focused on performance impact rather than raw overhead [1]. TL;DR: we didn’t observe any noticeable impact at 10K RPS. CPU usage stayed around 200 millicores (about 20% of a single core).

2. Coroot’s agent captures pseudo-traces (individual spans) and sends them to a collector via OTLP. This stream can be sampled at the collector level. In high-load environments, you can disable span capturing entirely and rely solely on eBPF-based metrics for analysis.

3. We’ve built automated root cause analysis to help users explain even the slightest anomalies, whether or not SLOs are violated. Under the hood, it traverses the service dependency graph and correlates metrics — for example, linking increased service latency to CPU delay or network latency to a database. [2]

4. Currently, Coroot doesn’t support off-CPU profiling. The profiler we use under the hood is based on Grafana Pyroscope’s eBPF implementation, which focuses on CPU time.

[1]: https://docs.coroot.com/installation/performance-impact [2]: https://demo.coroot.com/p/tbuzvelk/anomalies/default:Deploym...