←back to thread

804 points jryio | 6 comments | | HN request time: 0s | source | bottom
Show context
speedgoose ◴[] No.45661785[source]
Looking at the htop screenshot, I notice the lack of swap. You may want to enable earlyoom, so your whole server doesn't go down when a service goes bananas. The Linux Kernel OOM killer is often a bit too late to trigger.

You can also enable zram to compress ram, so you can over-provision like the pros'. A lot of long-running software leaks memory that compresses pretty well.

Here is how I do it on my Hetzner bare-metal servers using Ansible: https://gist.github.com/fungiboletus/794a265cc186e79cd5eb2fe... It also works on VMs.

replies(15): >>45661833 #>>45662183 #>>45662569 #>>45662628 #>>45662841 #>>45662895 #>>45663091 #>>45664508 #>>45665044 #>>45665086 #>>45665226 #>>45666389 #>>45666833 #>>45673327 #>>45677907 #
1. Bender ◴[] No.45662841[source]
Another option would be to have more memory that required over-engineer and to adjust the oom score per app, adding early kill weight to non critical apps and negative weight to important apps. oom_score_adj is already set to -1000 by OpenSSH for example.

    NSDJUST=$(pgrep -x nsd); echo -en '-378' > /proc/"${NSDJUST}"/oom_score_adj
Another useful thing to do is effecively disable over-commit on all staging and production servers (0 ratio instead of 2 memory to fully disable as these do different things, memory 0 still uses formula)

    vm.overcommit_memory = 0
    vm.overcommit_ratio = 0
Also use a formula to set min_free and reserved memory using a formula from Redhat that I do not have handy based on installed memory. min_free can vary from 512KB to 16GB depending on installed memory.

    vm.admin_reserve_kbytes = 262144
    vm.user_reserve_kbytes = 262144
    vm.min_free_kbytes = 1024000
At least that worked for me in about 50,000 physical servers for over a decade that were not permitted to have swap and installed memory varied from 144GB to 4TB of RAM. OOM would only occur when the people configuring and pushing code would massively over-commit and not account for memory required by the kernel. Not following best practices defined by Java and thats a much longer story.

Another option is to limit memory per application in cgroups but that requires more explaining than I am putting in an HN comment.

Another useful thing is to never OOM kill in the first place on servers that are only doing things in memory and need not commit anything to disk. So don't do this on a disked database. This is for ephemeral nodes that should self heal. Wait 60 seconds so drac/ilo can capture crash message and then earth shattering kaboom...

    # cattle vs kittens, mooooo...
    kernel.panic = 60
    vm.panic_on_oom = 2
For a funny side note, those options can also be used as a holy hand grenade to intentionally unsafely reboot NFS diskless farms when failing over to entirely different NFS server clusters. setting panic to 15 mins, triggering OOM panic by setting min_free to 16TB at the command line via Ansible not in sysctl.conf, swapping clusters, arp storm and reconverge.
replies(2): >>45664799 #>>45666014 #
2. liqilin1567 ◴[] No.45664799[source]
Thanks for sharing I think these are very useful suggestions.
3. benterix ◴[] No.45666014[source]
The lengths people will go to avoid k8s... (very easy on Hetzner Cloud BTW).
replies(2): >>45667154 #>>45669059 #
4. Bender ◴[] No.45667154[source]
That's a more complex path I avoided discussing when I referenced CGroups. When I started doing these things kube clusters did not exist. These tips were for people using bare metal that have not decided as a company to go the k3/k8 route. Some of these settings will still apply to k8 physical nodes. The good people of Hetzner would be managing these settings on their bare metal that Kubernetes is running on and would not likely want their k8 nodes getting all broken, sticky and confused after a K8 daemon update results in memory leakage, billions of orphaned processes, etc...

Companies that use k3/k8's they may still have bare metal nodes that are dedicated to a role such as databases, ceph storage nodes, DMZ SFTP servers, PCI hosts that were deemed out of scope for kube clusters and of course any "kittens" such as Linux nodes turned into proprietary appliances after installing some proprietary application that will blow chunks if shimmed into k8's or any other type of abstraction layer.

5. carlhjerpe ◴[] No.45669059[source]
Every ClusterAPI infrastructure provider is similarly easy? Or what makes Hetzner Kubernetes extra easy?
replies(1): >>45669941 #
6. benterix ◴[] No.45669941{3}[source]
I mentioned Hetnzer only because the original article mentions it. To be fair, currently it is harder to use than any managed k8s offering because you need to deploy your control plane yourself (but fortunately there are several project that make it as easy as it can be, and this is what I was referring to).