Most active commenters

The Therac-25 Incident (2021)

(thedailywtf.com)

Show context

benrutter ◴[27 Aug 25 08:18 UTC] No.45036836[source]▶

> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

replies(13): >>45036898 #>>45037054 #>>45037090 #>>45037874 #>>45038109 #>>45038360 #>>45038467 #>>45038827 #>>45043421 #>>45044645 #>>45046867 #>>45046969 #>>45047517 #

1. WalterBright ◴[27 Aug 25 20:18 UTC] No.45044645[source]▶

>>45036836 #

I'm going to disagree.

I have years of experience at Boeing designing aircraft parts. The guiding principle is that no single failure should cause an accident.

The way to accomplish this is not "write quality software", nor is it "test the software thoroughly". The idea is "assume the software does the worst possible thing. Then make sure that there's an independent system that will prevent that worst case."

For the Therac-25, that means a detector of the amount of radiation being generated, which will cut it off if it exceeds a safe value. I'd also add that the radiation generator be physically incapable of generating excessive radiation.

replies(9): >>45045090 #>>45045473 #>>45046078 #>>45046192 #>>45047920 #>>45048437 #>>45048717 #>>45049878 #>>45049910 #

2. vjvjvjvjghv ◴[27 Aug 25 20:51 UTC] No.45045090[source]▶

>>45044645 (TP) #

In general I agree but there is bit more complexity. I work in medical devices and there are plenty of situations where a certain output is ok in some circumstance but deadly in another. That makes a stopgap a little more tricky.

I agree with the previous poster that the feedback from the field is lacking a lot. A lot of doctors don’t report problems back because they are used to bad interfaces. And then the feedback gets filtered through several layers of sales reps and product management. So a lot of info gets lost and fixes that could be simple won’t get done.

In general when you work in medical you are so overwhelmed by documentation and regulation that there isn’t much time left to do proper engineering. The FDA mostly looks at documentation done right and less at product done right.

replies(2): >>45045452 #>>45045655 #

3. WalterBright ◴[27 Aug 25 21:22 UTC] No.45045452[source]▶

>>45045090 #

At Boeing there's a required "failure analysis" document listing all the failure modes and why they won't cause a crash by themselves.

replies(2): >>45048051 #>>45049126 #

4. graypegg ◴[27 Aug 25 21:24 UTC] No.45045473[source]▶

>>45044645 (TP) #

I think the range of radiation dose might vary too much to make a radiation source a totally isolated system, but trying to keep it as a simple physical lockout, I could imagine part of the start up process involving inserting a small module containing a fuse that breaks at a certain current that could be swapped out for different therapies or something. Could even add a simple spring+electromagnet mechanism that kicks that module out when power gets cut so radiotechs have to at least acknowledge the fuse before start up each time.

I will say that me pretending to know how to best design medical equipment as a web developer is pretty full of myself haha. Highly doubt whatever I'm spouting is a new idea. The idea of working on this sort of high-reliability + high-recoverability systems seems really interesting though!

5. darepublic ◴[27 Aug 25 21:41 UTC] No.45045655[source]▶

>>45045090 #

16000 - 25000 rads right. Not safe under any circumstance?

replies(1): >>45046950 #

6. philjohn ◴[27 Aug 25 22:22 UTC] No.45046078[source]▶

>>45044645 (TP) #

This.

One of the biggest things I see in junior engineers that I mentor (working in backend high throughput, low latency, distributed systems) is not working out all of the various failure modes your system will likely encounter.

Network partitions, primary database outage, caching layer outage, increased latency ... all of these things can throw a spanner in the works, but until you've experienced them (or had a strong mentor guide you) it's all abstract and difficult to see when the happy path is right there.

I've recently entirely re-architected a critical component, and part of this was defense in depth. Stuff is going to go wrong, so having a second or even third line of defense is important.

replies(2): >>45047190 #>>45048063 #

7. layman51 ◴[27 Aug 25 22:36 UTC] No.45046192[source]▶

>>45044645 (TP) #

I might also add that apparently, older versions of the machine had physical “hardware interlocks” that would make accidents less likely no matter what the software was doing. So the older software was probably just thought to be reliable, but it had a physical mechanism that was helping it to not kill someone. On a less serious note that’s part of why car doors might still have keyholes even if normally they open in a fancy way with electronic fobs.

8. sgerenser ◴[28 Aug 25 00:27 UTC] No.45046950{3}[source]▶

>>45045655 #

Completely safe as long as the block of metal was in place. So you couldn’t just prevent the machine from putting out that much energy, you had to prevent it from doing that without the block in place.

replies(1): >>45048706 #

9. ◴[28 Aug 25 01:12 UTC] No.45047190[source]▶

>>45046078 #

10. technofiend ◴[28 Aug 25 03:29 UTC] No.45048063[source]▶

>>45046078 #

I recently had to argue a junior into leaving the health check frequency alone on an ECS container: the regular log entries annoyed her and she didn't know how to filter logs, so her solution was to take healthchecks down to every five minutes, as just one example of trying to talk to people about the unhappy path.

replies(1): >>45048985 #

11. jonahx ◴[28 Aug 25 04:40 UTC] No.45048437[source]▶

>>45044645 (TP) #

That makes sense. But wouldn't the "write quality software" and "test the software thoroughly" still be relevant to the individual pieces? If the chance of a catastrophic failure is the product of the failure rates of the pieces, getting P(PartFail) low helps too -- even if having multiple backups is the main source of protection.

12. Gud ◴[28 Aug 25 05:25 UTC] No.45048706{4}[source]▶

>>45046950 #

So there should have been an interlocking system

replies(1): >>45049829 #

13. fulafel ◴[28 Aug 25 05:26 UTC] No.45048717[source]▶

>>45044645 (TP) #

The GP didn't propose processes in the sw engineering part though but "the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed"

14. dwedge ◴[28 Aug 25 06:12 UTC] No.45048985{3}[source]▶

>>45048063 #

That sounds more like a disaster waiting to happen than a junior. I find it difficult to believe that she didn't know the purpose of the healthcheck, so it sounds like breaking (someone else's problem) instead of addressing gaps in ability

replies(1): >>45049918 #

15. Aloha ◴[28 Aug 25 06:37 UTC] No.45049126{3}[source]▶

>>45045452 #

Agreed - This is essentially the corner stone of systems failure analysis - something I wish architects thought about more in the software space.

I'm a product manager for an old (and if I'm being honest somewhat crusty) system of software, the software is buggy, all of it is, but its also self healing and resilient, so while yes, it fails with somewhat alarming regularity, with very lots and lots concerning looking error messages in the logs, but it never causes an outage because it self heals.

Good systems design isn't making bug free software or a bug free system, but rather a system where a total outage requires N+1 (maybe even N+N) things to fail before the end user notices. Failures should be driven by at most, edge cases - basically where the system is being operated outside of its design parameters, and those parameters need to reflect the real world and be known by most stakeholders in the system.

My gripe with software engineers sometimes, they're often too divorced from real users and real use cases, and too devoted to the written spec over what their users actually need to do with the software - I've seen some very elegant (and on paper, well designed) systems fall apart because if simple things like intermittent packet jitter, or latency swings (say between 10ms and 70ms) - these are real world conditions, often encountered by real world systems, but these spec driven systems fall apart once confronted with reality.

16. tech2 ◴[28 Aug 25 08:31 UTC] No.45049829{5}[source]▶

>>45048706 #

The earlier model that the 25 replaced was all mechanically interlocked. The belief was that software provided that same level of assurance. They performed manual testing but what they weren't able to do was reach a level of speed and fluency with the system to result in the failure modes which caused the issues. Lower hardware costs equals higher profit...

17. benrutter ◴[28 Aug 25 08:41 UTC] No.45049878[source]▶

>>45044645 (TP) #

It's not that I don't think that's important, but I think with failure you always have an issue around needing N+1 checks (please don't take this as an argument against checks though).

The Therac-25 was meant to have a detector of radiation levels to cut things off if a safe value was exceeded, but it didn't work. It could obviously have been improved, but you always have the possibility that "what if our check doesn't work?".

In the case of the Therac-25, if the first initial failures had been reported and investigated, my understanding is (I should make clear I'm not an expert here) it would have made the issues apparent, and it could have been recalled before any of the fatal incidents happened.

In a swiss cheese model of risk, you always want as many layers as possible, so your point about a detector fits in there, but the final layer should always be if an incident does happen, and something gets past all our checks, how can we make it likely that it gets investigated fully by the right person.

18. Cthulhu_ ◴[28 Aug 25 08:46 UTC] No.45049910[source]▶

>>45044645 (TP) #

Great point. Earlier in my career, and I think many can see it too, I was very diligent; thorough types, unit tests, defensive programming, assertions at one point, the works.

But this opens up a can of worms, as suddenly you have to deal with every edge case, test for every possible input, etc. This was before fuzz testing, too. Each line of defensive coding, every carefully crafted comment, etc all added to the maintenance burden; I'd even go as far as claim it increased uncertainty, because what if I forgot something?

15 years later and it feels like I'm doing far less advanced stuff (although in hindsight what I did then wasn't all that, but I made it advanced). One issue came up recently; a generic button component would render really tall if no label was given, which happened when a CMS editor did not fill in a label in an attempt to hide it. The knee-jerk response would be to add a check that disallows empty labels, or to not render the button if no label is given, or to use a default button label.

But now I think I'll look at the rendering bug and just... leave the rest. A button with an empty label isn't catastrophic. Writing rules for every possible edge case (empty label, whitespaces, UTF-8 characters escaping the bounds, too long text, too short text, non-text, the list goes on) just adds maintenance and complexity. And it's just a button.

19. Cthulhu_ ◴[28 Aug 25 08:48 UTC] No.45049918{4}[source]▶

>>45048985 #

The junior part there is that this person still believes they can / should read and comprehend all logs themselves. This just isn't viable at scale.

But same with code itself, a junior will have code that is "theirs", a medior/senior will (likely) work at scales where they can't keep it all in their heads. And that's when all the software development best practices come into play.

↑