←back to thread

804 points jryio | 2 comments | | HN request time: 0.412s | source
Show context
speedgoose ◴[] No.45661785[source]
Looking at the htop screenshot, I notice the lack of swap. You may want to enable earlyoom, so your whole server doesn't go down when a service goes bananas. The Linux Kernel OOM killer is often a bit too late to trigger.

You can also enable zram to compress ram, so you can over-provision like the pros'. A lot of long-running software leaks memory that compresses pretty well.

Here is how I do it on my Hetzner bare-metal servers using Ansible: https://gist.github.com/fungiboletus/794a265cc186e79cd5eb2fe... It also works on VMs.

replies(15): >>45661833 #>>45662183 #>>45662569 #>>45662628 #>>45662841 #>>45662895 #>>45663091 #>>45664508 #>>45665044 #>>45665086 #>>45665226 #>>45666389 #>>45666833 #>>45673327 #>>45677907 #
Bender ◴[] No.45662841[source]
Another option would be to have more memory that required over-engineer and to adjust the oom score per app, adding early kill weight to non critical apps and negative weight to important apps. oom_score_adj is already set to -1000 by OpenSSH for example.

    NSDJUST=$(pgrep -x nsd); echo -en '-378' > /proc/"${NSDJUST}"/oom_score_adj
Another useful thing to do is effecively disable over-commit on all staging and production servers (0 ratio instead of 2 memory to fully disable as these do different things, memory 0 still uses formula)

    vm.overcommit_memory = 0
    vm.overcommit_ratio = 0
Also use a formula to set min_free and reserved memory using a formula from Redhat that I do not have handy based on installed memory. min_free can vary from 512KB to 16GB depending on installed memory.

    vm.admin_reserve_kbytes = 262144
    vm.user_reserve_kbytes = 262144
    vm.min_free_kbytes = 1024000
At least that worked for me in about 50,000 physical servers for over a decade that were not permitted to have swap and installed memory varied from 144GB to 4TB of RAM. OOM would only occur when the people configuring and pushing code would massively over-commit and not account for memory required by the kernel. Not following best practices defined by Java and thats a much longer story.

Another option is to limit memory per application in cgroups but that requires more explaining than I am putting in an HN comment.

Another useful thing is to never OOM kill in the first place on servers that are only doing things in memory and need not commit anything to disk. So don't do this on a disked database. This is for ephemeral nodes that should self heal. Wait 60 seconds so drac/ilo can capture crash message and then earth shattering kaboom...

    # cattle vs kittens, mooooo...
    kernel.panic = 60
    vm.panic_on_oom = 2
For a funny side note, those options can also be used as a holy hand grenade to intentionally unsafely reboot NFS diskless farms when failing over to entirely different NFS server clusters. setting panic to 15 mins, triggering OOM panic by setting min_free to 16TB at the command line via Ansible not in sysctl.conf, swapping clusters, arp storm and reconverge.
replies(2): >>45664799 #>>45666014 #
benterix ◴[] No.45666014[source]
The lengths people will go to avoid k8s... (very easy on Hetzner Cloud BTW).
replies(2): >>45667154 #>>45669059 #
1. carlhjerpe ◴[] No.45669059[source]
Every ClusterAPI infrastructure provider is similarly easy? Or what makes Hetzner Kubernetes extra easy?
replies(1): >>45669941 #
2. benterix ◴[] No.45669941[source]
I mentioned Hetnzer only because the original article mentions it. To be fair, currently it is harder to use than any managed k8s offering because you need to deploy your control plane yourself (but fortunately there are several project that make it as easy as it can be, and this is what I was referring to).