Crash-only software: More than meets the eye (2006)

(lwn.net)

1. kayodelycaon ◴[30 Apr 24 18:55 UTC] No.40214769[source]▶

Crash-only is really hard to implement if another system is involved that isn't crash-only. If you crash in the middle of a network request, you may not know what state the other system is in.

I've had to deal with buggy mainframe software whose error messages had no relation to how much an operation succeeded. (And no way to ask it after the fact...) Welcome to the special hell.

replies(4): >>40215739 #>>40218514 #>>40219895 #>>40220818 #

2. ◴[30 Apr 24 19:16 UTC] No.40215041[source]▶

>>40212967 (OP) #

3. ashleyn ◴[30 Apr 24 19:17 UTC] No.40215064[source]▶

>>40212967 (OP) #

Probably should consider crash recovery as a second line of defense against lost data, not the primary line of defense. What are the stats on how often crash recovery failed?

replies(2): >>40215261 #>>40215924 #

4. cryptonector ◴[30 Apr 24 19:33 UTC] No.40215261[source]▶

>>40215064 #

Crashing more often would test that better.

5. TheDudeMan ◴[30 Apr 24 20:12 UTC] No.40215739[source]▶

>>40214769 #

Idempotent APIs + sane timeouts + retries.

6. eslaught ◴[30 Apr 24 20:30 UTC] No.40215924[source]▶

>>40215064 #

The entire point is that crash recovery fails because you rarely test it. By making it the one and only code path, by definition, you will be testing it all the time, so it is much more likely to work in the first place.

(The obvious counterargument being that if there are different ways in which the software can crash, this is still not an adequate defense.)

replies(1): >>40218595 #

7. mjb ◴[30 Apr 24 20:57 UTC] No.40216256[source]▶

>>40212967 (OP) #

> Crash-only software is actually more reliable because it takes into account from the beginning an unavoidable fact of computing - unexpected crashes.

This is a critical point for reliable single-machine systems, and for reliable distributed systems. Distributed systems avoid many classes of crashes through redundancy, allowing the overall system to recover (often with no impact) from the failure or crash of a single node. This provides an additional path to crash recovery: recovering from peers or replicas rather than from local state. In turn, this can simplify the tracking of local state (especially the kind of per-replica WAL or redo log accounting that database systems have to do), leading to improved performance and avoiding bugs.

But, as with single-system crashes, distributed systems need to deal with their own reality: correlated failures. These can be caused by correlated infrastructure failures (power, cooling, etc), by operations (e.g. deploying buggy software), or by the very data they're processing (e.g. a "poison pill" that crashes all the redundant nodes at once). And so, like the crash-only case with single-system software, reliable distributed systems need to be designed to recover from these correlated failure cases.

The constants are interestingly different, though. Single-system annual interrupt rates (AIR) are typically in the 1-10% range, while systems spread over multiple datacenters can feasibly see correlated failure rates several orders of magnitude lower. This could argue that having a "bad day" recovery path that's more expensive than regular node recovery is OK. Or, it could argue that the only feasible way of making sure that "bad day" recovery works is to exercise it often (which goes back to the crash-only argument).

8. Liftyee ◴[30 Apr 24 22:40 UTC] No.40217316[source]▶

>>40212967 (OP) #

In embedded systems watchdog timers are often used as a crash mechanism outside of the software itself, which will crash the program if it is not reset. I found this concept of crash-only software pretty neat - time to see if I can apply it.

replies(1): >>40218518 #

9. dinvlad ◴[01 May 24 01:19 UTC] No.40218479[source]▶

>>40212967 (OP) #

Man, I love Elixir even though I'm still trying to learn it

10. 01HNNWZ0MV43FF ◴[01 May 24 01:26 UTC] No.40218514[source]▶

>>40214769 #

Tbf isn't that equivalent to a network partition and then rebooting or replacing one node? The network will always go down in every middle point of an operation

11. eternityforest ◴[01 May 24 01:27 UTC] No.40218518[source]▶

>>40217316 #

There's an ongoing discussion on the Systemd issue tracker about making better use of HW watchdogs: https://github.com/systemd/systemd/issues/21083

12. watersb ◴[01 May 24 01:40 UTC] No.40218595{3}[source]▶

>>40215924 #

I always tell clients that backups are boring but restores are very exciting.

If you make restores boring, then you're closer to resilience.

If you have a backup procedure for mass storage, practice restoring that data.

13. TheDudeMan ◴[01 May 24 05:42 UTC] No.40219895[source]▶

>>40214769 #

Your comment suggests that you believe crash-only software to be inherently less reliable than the alternative. But that is opposite of the stated goal and supposed benefits.

14. WhyNotHugo ◴[01 May 24 08:21 UTC] No.40220818[source]▶

>>40214769 #

Regular software can crash in the middle of a network request too (e.g.: someone accidentally unplugged the wrong network cable, power outage, etc).

Crash-only software is likely to test recovery of such situation.

15. WhyNotHugo ◴[01 May 24 08:24 UTC] No.40220829[source]▶

>>40212967 (OP) #

The time it takes for some systems to shut down doesn't make sense to me.

My Alpine laptop takes about 3 seconds to shut down (which, honestly, seems like a lot of time). systemd-based system will give daemons 90000ms to shut down, which is an absurdly high amount of time (what kind of service can't exit in a few seconds?).

Honestly, I think that mostly the kernel needs to flush its caches, SIGTERM all processes and then halt. There's no reason for this to take more than 1s on a modern system, and if something takes too long to handle SIGTERM, then it'll go through recovery next time.

replies(1): >>40231125 #

16. Izkata ◴[01 May 24 23:55 UTC] No.40231125[source]▶

>>40220829 #

> 90000ms to shut down, which is an absurdly high amount of time (what kind of service can't exit in a few seconds?)

I think I remember reading the biggest culprit nowadays (when specifically looking for why it takes so long for ubuntu to shut down) is snapd, that it unmounts all its squashfs filesystems one by one before exiting.

replies(1): >>40235880 #

17. fl0ki ◴[02 May 24 13:13 UTC] No.40235836[source]▶

>>40212967 (OP) #

Intentional crashing can be fine. Unintentional crashing with telemetry can be fine because you're going to fix it.

Unintentional crashing without telemetry is terrible. I've seen too many systems built to "just panic because it'll restart and retry" that never converge because the retry hits the same conditions and no thought was put into how to monitor what is going wrong.

As you all know, such systems tend to also neglect jitter and backoff so the retrying clients also hot-loop slamming every dependency, even ones that weren't erroring prior to the crash.

I've seen people shell into k8s pods and poke around at files manually for an all-nighter because they didn't invest even one hour in telemetry beforehand. Even that was a second penance for the first crime: finding out about an outage because of a user escalation rather than an automated alert.

Ironically, at times, some attempt at monitoring was made but undermined by the crash, e.g. Prometheus metrics were exported but lost before they could be scraped.

We have a long way to go educating most developers about production maturity before it's safe to endorse crashing without accounting for the downsides.

This was written in 2006 when monitoring was barely on anyone's radar. It's understandable in that context. People reading it in a modern context have to BYO production maturity.

18. fl0ki ◴[02 May 24 13:17 UTC] No.40235880{3}[source]▶

>>40231125 #

Looking on the bright side, I would like to thank Ubuntu for self-destructing so comprehensively that it's exposed the Linux user base to other distributions that still respect their users.

19. mpweiher ◴[02 May 24 16:35 UTC] No.40238185[source]▶

>>40212967 (OP) #

macOS implemented this concept with "sudden termination".

Applications that opt-in and announce themselves as "clean" (via a flag in a shared page last I checked) can be killed at anytime by the system via kill -9.

replies(1): >>40245503 #

20. exikyut ◴[03 May 24 08:39 UTC] No.40245503[source]▶

>>40238185 #

How does this work in terms of UI?

Obviously if there are windows on the screen that clearly doesn't make sense.

Does this just apply to applications that are technically running but don't have any windows open?

Kinda makes sense I guess.

Would be even cooler if the feature screenshotted minimized windows and also worked in that case, but noone's doing that already, so uptake would likely not be too high.

replies(1): >>40252641 #

21. sillywalk ◴[03 May 24 21:34 UTC] No.40252641{3}[source]▶

>>40245503 #

Not sure how much this has changed since 2011..

From one of John Siracusa Mac OS X Reviews @ Ars Technica[0]:

"Sudden Termination, a feature that was introduced in Snow Leopard, allows applications to indicate to the system that it's safe to kill them "impolitely" (i.e., by sending them SIGKILL, causing them to terminate immediately, with no chance for potentially time-consuming clean-up operations to execute). Applications are expected to set this bit when they're sure they're not in the middle of doing something, have no open files, no unflushed buffers, and so on.

This feature enables Snow Leopard to log out, shut down, and restart more quickly than earlier versions of Mac OS X. When it can, the OS simply kills processes instead of politely asking them to exit. (When Snow Leopard was released, Apple made sure its own applications and daemon processes supported Sudden Termination, even if third-party applications didn't.)

Lion includes a new feature called Automatic Termination. Whereas Sudden Termination lets an application tell the system when it's okay to terminate it with extreme prejudice, Automatic Termination lets an application tell the system that it's okay to politely ask the program to exit.

But wait, isn't it always okay for the OS to politely ask an application to exit? Isn't that what's always happened in Mac OS X on logout, shutdown, or restart? Yes, but what makes Automatic Termination different is when and why this might happen. In Lion, the OS may terminate applications that are not in use in order to reclaim resources—primarily memory, but also things like file descriptors, CPU cycles, and processes. Advertisement

You read that right. Lion will quit your running applications behind your back if it decides it needs the resources, and if you don't appear to be using them. The heuristic for determining whether an application is "in use" is very conservative: it must not be the active application, it must have no visible, non-minimized windows—and, of course, it must explicitly support Automatic Termination.

Automatic Termination works hand-in-hand with autosave. Any application that supports Automatic Termination should also support autosave and document restore. Since only applications with no visible windows are eligible for Automatic Termination, and since by default the Dock does not indicate whether or not an application is running, the user might not even notice when an application is automatically terminated by the system. No dialog boxes will ask about unsaved changes, and when the user clicks on the application in the Dock to reactivate it, it should relaunch and appear exactly as it did before it was terminated."

[0] https://arstechnica.com/gadgets/2011/07/mac-os-x-10-7/8/