The migration process they described would be less painful with k8s. Especially with 2100+ nodes/VMs
At that point, if you're reaching in and scripting your pods to do what you want, you lose a lot of the benefits of convention and reusability that k8s promotes.
The current bottleneck appears to be etcd, boltdb is just a crappy data store. I would really like to try replacing boltdb with something like sqlite or rocksdb as the data persistence layer in etcd but that is non-trivial.
You also start seeing issues where certain k8s operators do not scale either, for example cilium cannot scale past 5k nodes currently. There are fundamental design issues where the cilium daemonset memory usage scales with the number of pods/endpoints in the cluster. In large clusters the cilium daemonset can be using multiple gigabytes of ram on every node in your cluster. https://docs.cilium.io/en/stable/operations/performance/scal...
Anyways, the TL;DR is that at this scale (16k nodes) it is hard to run k8s.
Care to elaborate at all? Were they more like missing edge cases or absent core functionality? Not to imply that missing edge cases aren’t important when it comes to DB ops.