The Therac-25 Incident (2021)

(thedailywtf.com)

449 points lemper | 1 comments | 27 Aug 25 06:57 UTC | HN request time: 0.232s | source

Show context

benrutter ◴[27 Aug 25 08:18 UTC] No.45036836[source]▶

> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

replies(13): >>45036898 #>>45037054 #>>45037090 #>>45037874 #>>45038109 #>>45038360 #>>45038467 #>>45038827 #>>45043421 #>>45044645 #>>45046867 #>>45046969 #>>45047517 #

0xDEAFBEAD ◴[27 Aug 25 11:50 UTC] No.45038360[source]▶

>>45036836 #

Honestly I wish instead of the Therac-25, we were discussing a system which made use of unit testing and defensive coding, yet still failed. That would be more educational. It's too easy to look at the Therac-25 and think "I would never write a mess like that".

replies(5): >>45038635 #>>45038899 #>>45042566 #>>45044920 #>>45046431 #

1. jopsen ◴[27 Aug 25 20:39 UTC] No.45044920[source]▶

>>45038360 #

I'd agree, it's super easy to think such errors won't happen had they just used a fairly safe language and sane architecture. Or unit test, race detectors, etc.

I suspect that few organizations that do all that, have a process/culture of ignoring bugs in the wild -- and those that do have such complicated domains that explaining the error is hard.

Software best practices today would probably also involve sending metrics, logs, error reports, etc.

That said, it's still extremely easy get embrace a culture were unexplainable errors are ignored. Especially in a cloud environment.

↑