←back to thread

449 points lemper | 1 comments | | HN request time: 0.205s | source
Show context
benrutter ◴[] No.45036836[source]
> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

replies(13): >>45036898 #>>45037054 #>>45037090 #>>45037874 #>>45038109 #>>45038360 #>>45038467 #>>45038827 #>>45043421 #>>45044645 #>>45046867 #>>45046969 #>>45047517 #
WalterBright ◴[] No.45044645[source]
I'm going to disagree.

I have years of experience at Boeing designing aircraft parts. The guiding principle is that no single failure should cause an accident.

The way to accomplish this is not "write quality software", nor is it "test the software thoroughly". The idea is "assume the software does the worst possible thing. Then make sure that there's an independent system that will prevent that worst case."

For the Therac-25, that means a detector of the amount of radiation being generated, which will cut it off if it exceeds a safe value. I'd also add that the radiation generator be physically incapable of generating excessive radiation.

replies(9): >>45045090 #>>45045473 #>>45046078 #>>45046192 #>>45047920 #>>45048437 #>>45048717 #>>45049878 #>>45049910 #
1. benrutter ◴[] No.45049878[source]
It's not that I don't think that's important, but I think with failure you always have an issue around needing N+1 checks (please don't take this as an argument against checks though).

The Therac-25 was meant to have a detector of radiation levels to cut things off if a safe value was exceeded, but it didn't work. It could obviously have been improved, but you always have the possibility that "what if our check doesn't work?".

In the case of the Therac-25, if the first initial failures had been reported and investigated, my understanding is (I should make clear I'm not an expert here) it would have made the issues apparent, and it could have been recalled before any of the fatal incidents happened.

In a swiss cheese model of risk, you always want as many layers as possible, so your point about a detector fits in there, but the final layer should always be if an incident does happen, and something gets past all our checks, how can we make it likely that it gets investigated fully by the right person.