The Therac-25 Incident (2021)

Show context

benrutter ◴[27 Aug 25 08:18 UTC] No.45036836[source]▶

> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

replies(13): >>45036898 #>>45037054 #>>45037090 #>>45037874 #>>45038109 #>>45038360 #>>45038467 #>>45038827 #>>45043421 #>>45044645 #>>45046867 #>>45046969 #>>45047517 #

WalterBright ◴[27 Aug 25 20:18 UTC] No.45044645[source]▶

>>45036836 #

I'm going to disagree.

I have years of experience at Boeing designing aircraft parts. The guiding principle is that no single failure should cause an accident.

The way to accomplish this is not "write quality software", nor is it "test the software thoroughly". The idea is "assume the software does the worst possible thing. Then make sure that there's an independent system that will prevent that worst case."

For the Therac-25, that means a detector of the amount of radiation being generated, which will cut it off if it exceeds a safe value. I'd also add that the radiation generator be physically incapable of generating excessive radiation.

replies(9): >>45045090 #>>45045473 #>>45046078 #>>45046192 #>>45047920 #>>45048437 #>>45048717 #>>45049878 #>>45049910 #

philjohn ◴[27 Aug 25 22:22 UTC] No.45046078[source]▶

>>45044645 #

This.

One of the biggest things I see in junior engineers that I mentor (working in backend high throughput, low latency, distributed systems) is not working out all of the various failure modes your system will likely encounter.

Network partitions, primary database outage, caching layer outage, increased latency ... all of these things can throw a spanner in the works, but until you've experienced them (or had a strong mentor guide you) it's all abstract and difficult to see when the happy path is right there.

I've recently entirely re-architected a critical component, and part of this was defense in depth. Stuff is going to go wrong, so having a second or even third line of defense is important.

replies(2): >>45047190 #>>45048063 #