I wonder if an upgrade like this would be less painful if the db layer was containerized?
The migration process they described would be less painful with k8s. Especially with 2100+ nodes/VMs
replies(5):
The current bottleneck appears to be etcd, boltdb is just a crappy data store. I would really like to try replacing boltdb with something like sqlite or rocksdb as the data persistence layer in etcd but that is non-trivial.
You also start seeing issues where certain k8s operators do not scale either, for example cilium cannot scale past 5k nodes currently. There are fundamental design issues where the cilium daemonset memory usage scales with the number of pods/endpoints in the cluster. In large clusters the cilium daemonset can be using multiple gigabytes of ram on every node in your cluster. https://docs.cilium.io/en/stable/operations/performance/scal...
Anyways, the TL;DR is that at this scale (16k nodes) it is hard to run k8s.