We recently did the same, and our Datadog bill was only five figures. We're finding the new stack to not be a poor man's anything, but more flexible, complete and manageable than yet another SaaS. With just a little extra learning curve observability is a domain where open source trounces proprietary, and not just if you don't have money to set on fire.
Crazy crazy they spent so much on observability. Even with DataDog they could've optimized that spend. DataDog does lots of bad things with billing where by default, especially with on-demand instances you get charged significantly more than you should as they have (had?) pretty deficient counting towards instance hours and instances.
For example, rather than run the agent (which counts as an instance regardless of if it's on for a minute), you can send the logs, metrics, etc. directly to their ingestion endpoints and not have those instances counted towards their usage other than log and metric usage.
Maybe at that level they don't even get into actual by usage anymore, and they just negotiate arbitrary amounts for some absurd quota of use.
I won’t single out Datadog on this because the exact same thing happens with cloud spend, and it’s very literally burning money.
It is not hard to spin up Grafana and VictoriaMetrics (and now VictoriaLogs) and keep them running. It is not hard to build a Grafana dashboard that correlates data across both metrics and logs sources, and alerting functionality is pretty good now.
The "heavy lift" is instrumenting your applications and infrastructure to provide valuable metrics and logs without exceeding a performance budget. I'm skeptical that Datadog actually does much of that heavy-lifting and that they are actually worth the money. You can probably save 10x with same/better outcomes by paying for managed Grafana + managed DBs and a couple FTEs as observability experts.
Those are some pretty heroic assumptions. In particular, they assume the only options are Datadog or nothing, when there are far cheaper alternatives like the Prometheus/Grafana/Clickhouse stack mentioned in the article itself.
I saw this a lot at a previous company. Being able to just "have more Lambdas scale up to handle it" got some very mediocre engineers past challenges they encountered. But it did so at the cost of wasting VAST amounts of money and saddling themselves with tech debt that completely hobbled the company's ability to scale.
It was very frustrating to be too junior to be able to change minds. Even basic things like "I know it worked for you with old on-prem NFS designs but we shouldn't be storing our data in 100kb files in S3 and firing off thousands of Lambda invocations to process workloads, we should be storing it in 100mb files and using industry leading ETL frameworks on it". They were old school guys who hadn't adjusted to best practices for object storage and modern large scale data loads (this was a 1M event per second system) and so the company never really succeeded despite thousands of customers and loads of revenue.
I consider cost consideration and profiling to be an essential skill that any engineer working in cloud style environments should have, but it's especially important that a staff engineer or person in a similar position have this skill set and be ready to grill people who come up with wasteful solutions.
Does anyone have such an experience with Datadog? A few million wasn't enough to get them to talk about anything, always paid list price and there was no negotiating either when they restructured their pricing.
am i misunderstanding, or is the author saying it's better to spend $10m than $9m?
Most startups are not going to have anywhere near the scale to generate anything approaching this bill.
> I won’t single out Datadog on this because the exact same thing happens with cloud spend, and it’s very literally burning money.
Unless you're in the business of deploying and maintaining production-ready datacenters at scale, it very literally isn't.
I have found Datadog to be, by far hands down the best developer experience from the get go, the way it glues the mostly decent products together is unparalleled in comparison to other products (Grafana cloud/LGTM). I usually say if your at a small to medium scale business just makes sense, IF you understand the product and configure it correctly which is reasonably easy. The seamless integration between tracing, logging and metrics in the platform, which you can then easily combine with alerts is great. However, its easy to misconfigure it and spend a lot of money on seemingly nothing. If you do not implement tracing and structured logs (at the right volume and level) with trace/span ids etc all the way through services its hard to see the value, and seems expensive. It requires some good knowledge, and configuration of the product to make it pay off. The rest of the product features are generally good, for example their security suite is a good entry level to cloud security monitoring and SEIM too.
However, when you get to a certain scale, the cost of APM and Infrastructure hosts in Datadog can become become somewhat prohibitive. Also, Datadogs custom metrics pricing is somewhat expensive and its query language cababilities does not quite match the power of promql, and you start to find yourself needed them to debug issues. At that point, the self hosted LGTM stack starts to make sense, however, it involves a lot more education for end users in both integration (a little less now Otel is popular) and querying/building dashboards etc, but also running it yourself. The grafana cloud platform is more attractive though.
[0] https://www.listennotes.com/blog/use-betterstack-to-replace-...
2. Management doesn’t get recognized for working on undifferentiated features.
3. Engineers working on undifferentiated features aren’t recognized when looking for new jobs.
Saving money “makes” sense but getting people to actually prioritize it is hard.
Even from a pure zero-sum mathematical perspective, it can make sense to invest even as much as 2 or 3 months of engineer time on cloud cost savings measures. If the engineer is making $200K, that's a $30000 - $50000 investment. When you see the eye-watering cloud bills many startups have, you would realize that, that investment is peanuts in comparison to the potential savings over the next several years.
And then you also have to keep in mind that, these things are usually not actually zero-sum. The engineer could be new, and working on the efficiency project helps them onboard to your stack. It could be the case that customers are complaining (or could start complaining in the future) about how slow your product is, so you actually improve the product by improving the infrastructure. Or it could just be the very common case that there isn't actually a higher-value thing for that engineer to be working on at that time.
I don't know if I would call them mediocre, but without a feedback loop its hard to get engineers to agree whether it's worth time reviewing the code to make it faster compared to just making the db one size larger.
If Jira has taught me anything, it's that ignoring customers when they complain its too slow makes financial sense.
These days I'd suggest to just suck it up, spin up a Grafana box, and wire up OpenTelemetry.
I believe a much more useful question to ask is just “is this the highest and best use of my finite attention and time?” It is much easier to find $100,000 a year of budget than it is to find an additional $50,000 worth of skilled[1] developer time.
[1] This skilled part is critical because if you have some flunky create your “SaaS alternative” you are in for an even worse time.
Yeah one of my big pet peeves was when engineering teams build platforms to run things on that obscure the cost. There have been times where they said "hey we made this big platform for analytics, just ship your stuff as configuration changes and it's deployed!" Then when I did it with very simple small cases, some unoptimized stuff on their end (a lot of what I talked about before) resulted in runaway costs that they, of course, tagged to my team.
Ultimately, you can only control what's in your scope and anything else you will need to hope that management can take that runaway cost feedback and make the correct team optimize it away.
>I don't know if I would call them mediocre, but without a feedback loop its hard to get engineers to agree whether it's worth time reviewing the code to make it faster compared to just making the db one size larger.
This started in the mid 2010s, by which point they should understand that you don't put terabytes of data into S3 in 100kb files. And if not, they should be willing to take some very simple steps to address it (literally just bundling them all in 100mb files with an index file containing the byte offsets of the individual ones would have solved a lot of their problems). There was a feedback loop. There just happened to be big egos more interested in their next fun project of reinventing another solution to another solved problem. I learned there that engineering driven companies sometimes wind up in situations in which the staff engineers love fun new database and infrastructure projects like that more than they enjoy improving their existing product.
But what about other open source solutions that already trying very hard to become an out-of-box solution for observability? Things like Netdata, Hyperdx, Coroot, etc. are already platforms for all telemetry signals, with fancy UIs and a lot of presets. Why people don't use them instead of Datadog?
Grafana isn't quite as featureful as Datadog, though nothing to keep you from getting the job done.
> But, it requires engineering time to set up these tools
At some price point, you have to wonder if it doesn't make more sense to hire engineers to get it just right for your use case. I'd bet that price point is less than $65MM. Hell, you could have people full-time on Grafana to add features you want.