I won’t single out Datadog on this because the exact same thing happens with cloud spend, and it’s very literally burning money.
I won’t single out Datadog on this because the exact same thing happens with cloud spend, and it’s very literally burning money.
It is not hard to spin up Grafana and VictoriaMetrics (and now VictoriaLogs) and keep them running. It is not hard to build a Grafana dashboard that correlates data across both metrics and logs sources, and alerting functionality is pretty good now.
The "heavy lift" is instrumenting your applications and infrastructure to provide valuable metrics and logs without exceeding a performance budget. I'm skeptical that Datadog actually does much of that heavy-lifting and that they are actually worth the money. You can probably save 10x with same/better outcomes by paying for managed Grafana + managed DBs and a couple FTEs as observability experts.
2. Management doesn’t get recognized for working on undifferentiated features.
3. Engineers working on undifferentiated features aren’t recognized when looking for new jobs.
Saving money “makes” sense but getting people to actually prioritize it is hard.
Even from a pure zero-sum mathematical perspective, it can make sense to invest even as much as 2 or 3 months of engineer time on cloud cost savings measures. If the engineer is making $200K, that's a $30000 - $50000 investment. When you see the eye-watering cloud bills many startups have, you would realize that, that investment is peanuts in comparison to the potential savings over the next several years.
And then you also have to keep in mind that, these things are usually not actually zero-sum. The engineer could be new, and working on the efficiency project helps them onboard to your stack. It could be the case that customers are complaining (or could start complaining in the future) about how slow your product is, so you actually improve the product by improving the infrastructure. Or it could just be the very common case that there isn't actually a higher-value thing for that engineer to be working on at that time.
If Jira has taught me anything, it's that ignoring customers when they complain its too slow makes financial sense.
These days I'd suggest to just suck it up, spin up a Grafana box, and wire up OpenTelemetry.