How We Found 7 TiB of Memory Just Sitting Around

I'm a little surprised that it got to the point where pods which should consume a couple MB of RAM were consuming 4GB before action was taken. But I can also kind of understand it, because the way k8s operators (apps running in k8s that manipulate k8s resource) are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec. This reconciliation loop is simple to understand (and I think this benefit has led to the creation of a wide array of excellent open source and proprietary operators that can be added to clusters). But its also a recipe for cascading explosions in resource usage.

These kind of resource explosions are something I see all the time in k8s clusters. The general advice is to always try and keep pressure off the k8s API, and the consequence is that one must be very minimal and tactical with the operators one installs, and then engage in many hours of work trying to fine tune each operator to run efficiently (e.g. Grafana, whose default helm settings do not use the recommended log indexing algorithm, and which needs to be tweaked to get an appropriate set of read vs. write pods for your situation).

Again, I recognize there is a tradeoff here - the simplicity and openness of the k8s API is what has led to a flourish of new operators, which really has allowed one to run "their own cloud". But there is definitely a cost. I don't know what the solution is, and I'm curious to hear from people who have other views of it, or use other solutions to k8s which offer a different set of tradeoffs.