How We Found 7 TiB of Memory Just Sitting Around

1. Aeolun ◴[01 Nov 25 02:54 UTC] No.45778883[source]▶

I read this and I have to wonder, did anyone ever think it was reasonable that a cluster that apparently needed only 120gb of memory was consuming 1.2TB just for logging (or whatever vector does)

replies(4): >>45779398 #>>45779624 #>>45779726 #>>45780688 #

2. bstack ◴[01 Nov 25 05:08 UTC] No.45779398[source]▶

>>45778883 (TP) #

Author here: You’d be surprised what you don’t notice given enough nodes and slow enough resource growth over time! Out of the total resource usage in these clusters even at the high water mark for this daemonset it was still a small overall portion of the total.

replies(2): >>45779736 #>>45779879 #

3. devjab ◴[01 Nov 25 06:21 UTC] No.45779624[source]▶

>>45778883 (TP) #

We're a much smaller scale company and the cost we lose on these things is insignificant compared to what's in this story. Yesterday I was improving the process for creating databases in our azure and I stumbled upon a subscription which was running 7 mssql servers for 12 databases. These weren't elastic and they were each paying a license that we don't have to pay because we qualify for the base cost through our contract with our microsoft partner. This company has some of the thightest control over their cloud infrastructure out of any organisation I've worked with.

This is anecdotal, but if my experiences aren't unique then there is a lot of lack of reasonable in DevOps.

replies(1): >>45780582 #

4. fock ◴[01 Nov 25 06:53 UTC] No.45779726[source]▶

>>45778883 (TP) #

we have on-prem with heavy spikes (our batch workload can utilize the 20TB of memory in the cluster easily) and we just don't care much and add 10% every year to the hardware requested. Compared to employing people or paying other vendors (relational databases with many TB-sized tables...) this is just irrelevant.

Sadly devs are incentivized by that and going towards the cloud might be a fun story. Given the environment I hope they scrap the effort sooner rather than later, buy some Oxide systems for the people who need to iterate faster than the usual process of getting a VM and replace/reuse the 10% of the company occupied with the cloud (mind you: no real workload runs there yet...) to actually improve local processes...

replies(1): >>45783714 #

5. fock ◴[01 Nov 25 06:58 UTC] No.45779736[source]▶

>>45779398 #

how large are the clusters then?

6. Aeolun ◴[01 Nov 25 07:39 UTC] No.45779879[source]▶

>>45779398 #

I’m not sure if that makes it better or worse.

replies(2): >>45780858 #>>45785752 #

7. ffsm8 ◴[01 Nov 25 10:22 UTC] No.45780582[source]▶

>>45779624 #

Isn't that mostly down to the fact the vast majority of devs explicitly don't want to do anything wrt Ops?

DevOps has - ever since it's originally well meaning inception (by Netflix iirc?) - been implemented across our industry as an effective cost cutting measure, forcing devs that didn't see it as their job to also handle it.

Which consequently means they're not interfacing with it whatsoever. They do as little as they can get away with, which inevitably means things are being done with borderline malicious compliance... Or just complete incompetence.

I'm not even sure I'd blame these devs in particular. The devs just saw it as a quick bonus generator for the MBA in charge of this rebranding while offloading more responsibilities in their shoulders.

DevOps made total sense in the work culture where this concept was conceived - Netflix was well known at that point to only ever employ senior Devs. However, in the context of the average 9-5 dev, which often knows a lot less then even some enthusiastic Jrs... Let's just say that it's incredibly dicey wherever it's successful in practice.

replies(2): >>45780998 #>>45784145 #

8. formerly_proven ◴[01 Nov 25 10:51 UTC] No.45780688[source]▶

>>45778883 (TP) #

It probably doesn't help that the first line of treatment for any error is to blindly increase memory request/limit and claim it's fixed (preferably without looking at the logs once).

9. embedding-shape ◴[01 Nov 25 11:30 UTC] No.45780858{3}[source]▶

>>45779879 #

I didn't know what Render was when I skimmed the article at first, but after reading these comments, I had to check out what they do.

And they're a "Cloud Application Platform" meaning they manage deploys and infrastructure for other people. Their website says "Click, click, done." which is cool and quick and all, but to me it's kind of crazy an organization that should be really engineering focused and mature, doesn't immediately notice 1.2TB being used and tries to figure out why, when 120GB ended up being sufficient.

It gives much more of a "We're a startup, we're learning as we're running" vibe which again, cool and all, but hardly what people should use for hosting their own stuff on.

replies(1): >>45791424 #

10. mustyoshi ◴[01 Nov 25 12:03 UTC] No.45780998{3}[source]▶

>>45780582 #

I politely disagree. I spent maybe 8 hours over a week rightsizing a handful of heavy deployments from a previous team and reduced their peak resource usage by implementing better scaling policies. Before the new scaling policy the service would scale out and new pods would remain idle and ultimately get terminated without ever responding to a request quite frequently.

The service dashboards already existed, all I had to do was a bit of load testing and read the graphs.

It's not too much extra work to make sure you're scaling efficiently.

replies(1): >>45781223 #

11. ffsm8 ◴[01 Nov 25 12:46 UTC] No.45781223{4}[source]▶

>>45780998 #

You disagree but then cite another example of low hanging fruits that nobody took action on until you came along?

Did you accidentally respond to the wrong comment? Because if anything you're giving another example of "most devs not wanting to interface with ops, hence letting it slide until someone bothers to pick up their slack"...

12. g-mork ◴[01 Nov 25 17:52 UTC] No.45783714[source]▶

>>45779726 #

Somewhat unrelated, but you just tied wasteful software design to high it salaries, and also suggest a reason why Russian programmers might also seem to on the whole be far more effective than we are

I wonder if msft simply cut dev salaries by 50% in the 90s, would it have had any measurable effect on windows quality by today

13. FroshKiller ◴[01 Nov 25 18:39 UTC] No.45784145{3}[source]▶

>>45780582 #

The first time my director asked me if I'd ever heard of DevOps, I said, "Sure, doing two jobs for one paycheck." I'm a software developer, buddy. I write the programs. Leave me out of running them.

replies(1): >>45787072 #

14. antoniojtorres ◴[01 Nov 25 21:53 UTC] No.45785752{3}[source]▶

>>45779879 #

It seems realistic to me, commonplace even. Lots to do in a company like this one.

15. jiggawatts ◴[02 Nov 25 01:18 UTC] No.45787072{4}[source]▶

>>45784145 #

> Leave me out of running them.

This is how customers end up with too-expensive Rube Goldberg machines.

You have to take some interest in how your code will run in production, even if you don't personally "operate" it.

16. Anonbrit ◴[02 Nov 25 16:25 UTC] No.45791424{4}[source]▶

>>45780858 #

If your report for the month is "I saved a terabyte of ram usage across our cluster estate!" and I as a manager do some quick maths and say great, that's our income from 2 median customers. We lost 8 customers because we didn't laugh feature foo in time, which is what you were supposed to be working on, so your contribution for the month is a massive loss to the company...

Does that frame things differently? There's are times in your product lifecycle where you doing want your developers looking at things like this, and a time when you do