I've had to deal with buggy mainframe software whose error messages had no relation to how much an operation succeeded. (And no way to ask it after the fact...) Welcome to the special hell.
I've had to deal with buggy mainframe software whose error messages had no relation to how much an operation succeeded. (And no way to ask it after the fact...) Welcome to the special hell.
(The obvious counterargument being that if there are different ways in which the software can crash, this is still not an adequate defense.)
This is a critical point for reliable single-machine systems, and for reliable distributed systems. Distributed systems avoid many classes of crashes through redundancy, allowing the overall system to recover (often with no impact) from the failure or crash of a single node. This provides an additional path to crash recovery: recovering from peers or replicas rather than from local state. In turn, this can simplify the tracking of local state (especially the kind of per-replica WAL or redo log accounting that database systems have to do), leading to improved performance and avoiding bugs.
But, as with single-system crashes, distributed systems need to deal with their own reality: correlated failures. These can be caused by correlated infrastructure failures (power, cooling, etc), by operations (e.g. deploying buggy software), or by the very data they're processing (e.g. a "poison pill" that crashes all the redundant nodes at once). And so, like the crash-only case with single-system software, reliable distributed systems need to be designed to recover from these correlated failure cases.
The constants are interestingly different, though. Single-system annual interrupt rates (AIR) are typically in the 1-10% range, while systems spread over multiple datacenters can feasibly see correlated failure rates several orders of magnitude lower. This could argue that having a "bad day" recovery path that's more expensive than regular node recovery is OK. Or, it could argue that the only feasible way of making sure that "bad day" recovery works is to exercise it often (which goes back to the crash-only argument).
Crash-only software is likely to test recovery of such situation.
My Alpine laptop takes about 3 seconds to shut down (which, honestly, seems like a lot of time). systemd-based system will give daemons 90000ms to shut down, which is an absurdly high amount of time (what kind of service can't exit in a few seconds?).
Honestly, I think that mostly the kernel needs to flush its caches, SIGTERM all processes and then halt. There's no reason for this to take more than 1s on a modern system, and if something takes too long to handle SIGTERM, then it'll go through recovery next time.
I think I remember reading the biggest culprit nowadays (when specifically looking for why it takes so long for ubuntu to shut down) is snapd, that it unmounts all its squashfs filesystems one by one before exiting.
Unintentional crashing without telemetry is terrible. I've seen too many systems built to "just panic because it'll restart and retry" that never converge because the retry hits the same conditions and no thought was put into how to monitor what is going wrong.
As you all know, such systems tend to also neglect jitter and backoff so the retrying clients also hot-loop slamming every dependency, even ones that weren't erroring prior to the crash.
I've seen people shell into k8s pods and poke around at files manually for an all-nighter because they didn't invest even one hour in telemetry beforehand. Even that was a second penance for the first crime: finding out about an outage because of a user escalation rather than an automated alert.
Ironically, at times, some attempt at monitoring was made but undermined by the crash, e.g. Prometheus metrics were exported but lost before they could be scraped.
We have a long way to go educating most developers about production maturity before it's safe to endorse crashing without accounting for the downsides.
This was written in 2006 when monitoring was barely on anyone's radar. It's understandable in that context. People reading it in a modern context have to BYO production maturity.
Obviously if there are windows on the screen that clearly doesn't make sense.
Does this just apply to applications that are technically running but don't have any windows open?
Kinda makes sense I guess.
Would be even cooler if the feature screenshotted minimized windows and also worked in that case, but noone's doing that already, so uptake would likely not be too high.
From one of John Siracusa Mac OS X Reviews @ Ars Technica[0]:
"Sudden Termination, a feature that was introduced in Snow Leopard, allows applications to indicate to the system that it's safe to kill them "impolitely" (i.e., by sending them SIGKILL, causing them to terminate immediately, with no chance for potentially time-consuming clean-up operations to execute). Applications are expected to set this bit when they're sure they're not in the middle of doing something, have no open files, no unflushed buffers, and so on.
This feature enables Snow Leopard to log out, shut down, and restart more quickly than earlier versions of Mac OS X. When it can, the OS simply kills processes instead of politely asking them to exit. (When Snow Leopard was released, Apple made sure its own applications and daemon processes supported Sudden Termination, even if third-party applications didn't.)
Lion includes a new feature called Automatic Termination. Whereas Sudden Termination lets an application tell the system when it's okay to terminate it with extreme prejudice, Automatic Termination lets an application tell the system that it's okay to politely ask the program to exit.
But wait, isn't it always okay for the OS to politely ask an application to exit? Isn't that what's always happened in Mac OS X on logout, shutdown, or restart? Yes, but what makes Automatic Termination different is when and why this might happen. In Lion, the OS may terminate applications that are not in use in order to reclaim resources—primarily memory, but also things like file descriptors, CPU cycles, and processes. Advertisement
You read that right. Lion will quit your running applications behind your back if it decides it needs the resources, and if you don't appear to be using them. The heuristic for determining whether an application is "in use" is very conservative: it must not be the active application, it must have no visible, non-minimized windows—and, of course, it must explicitly support Automatic Termination.
Automatic Termination works hand-in-hand with autosave. Any application that supports Automatic Termination should also support autosave and document restore. Since only applications with no visible windows are eligible for Automatic Termination, and since by default the Dock does not indicate whether or not an application is running, the user might not even notice when an application is automatically terminated by the system. No dialog boxes will ask about unsaved changes, and when the user clicks on the application in the Dock to reactivate it, it should relaunch and appear exactly as it did before it was terminated."
[0] https://arstechnica.com/gadgets/2011/07/mac-os-x-10-7/8/