The Therac-25 Incident (2021)

(thedailywtf.com)

449 points lemper | 2 comments | 27 Aug 25 06:57 UTC | HN request time: 0s | source

Show context

benrutter ◴[27 Aug 25 08:18 UTC] No.45036836[source]▶

> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

replies(13): >>45036898 #>>45037054 #>>45037090 #>>45037874 #>>45038109 #>>45038360 #>>45038467 #>>45038827 #>>45043421 #>>45044645 #>>45046867 #>>45046969 #>>45047517 #

WalterBright ◴[27 Aug 25 20:18 UTC] No.45044645[source]▶

>>45036836 #

I'm going to disagree.

I have years of experience at Boeing designing aircraft parts. The guiding principle is that no single failure should cause an accident.

The way to accomplish this is not "write quality software", nor is it "test the software thoroughly". The idea is "assume the software does the worst possible thing. Then make sure that there's an independent system that will prevent that worst case."

For the Therac-25, that means a detector of the amount of radiation being generated, which will cut it off if it exceeds a safe value. I'd also add that the radiation generator be physically incapable of generating excessive radiation.

replies(9): >>45045090 #>>45045473 #>>45046078 #>>45046192 #>>45047920 #>>45048437 #>>45048717 #>>45049878 #>>45049910 #

vjvjvjvjghv ◴[27 Aug 25 20:51 UTC] No.45045090[source]▶

>>45044645 #

In general I agree but there is bit more complexity. I work in medical devices and there are plenty of situations where a certain output is ok in some circumstance but deadly in another. That makes a stopgap a little more tricky.

I agree with the previous poster that the feedback from the field is lacking a lot. A lot of doctors don’t report problems back because they are used to bad interfaces. And then the feedback gets filtered through several layers of sales reps and product management. So a lot of info gets lost and fixes that could be simple won’t get done.

In general when you work in medical you are so overwhelmed by documentation and regulation that there isn’t much time left to do proper engineering. The FDA mostly looks at documentation done right and less at product done right.

replies(2): >>45045452 #>>45045655 #

1. WalterBright ◴[27 Aug 25 21:22 UTC] No.45045452[source]▶

>>45045090 #

At Boeing there's a required "failure analysis" document listing all the failure modes and why they won't cause a crash by themselves.

replies(2): >>45048051 #>>45049126 #

2. Aloha ◴[28 Aug 25 06:37 UTC] No.45049126[source]▶

>>45045452 (TP) #

Agreed - This is essentially the corner stone of systems failure analysis - something I wish architects thought about more in the software space.

I'm a product manager for an old (and if I'm being honest somewhat crusty) system of software, the software is buggy, all of it is, but its also self healing and resilient, so while yes, it fails with somewhat alarming regularity, with very lots and lots concerning looking error messages in the logs, but it never causes an outage because it self heals.

Good systems design isn't making bug free software or a bug free system, but rather a system where a total outage requires N+1 (maybe even N+N) things to fail before the end user notices. Failures should be driven by at most, edge cases - basically where the system is being operated outside of its design parameters, and those parameters need to reflect the real world and be known by most stakeholders in the system.

My gripe with software engineers sometimes, they're often too divorced from real users and real use cases, and too devoted to the written spec over what their users actually need to do with the software - I've seen some very elegant (and on paper, well designed) systems fall apart because if simple things like intermittent packet jitter, or latency swings (say between 10ms and 70ms) - these are real world conditions, often encountered by real world systems, but these spec driven systems fall apart once confronted with reality.

↑