←back to thread

449 points lemper | 9 comments | | HN request time: 0.407s | source | bottom
Show context
benrutter ◴[] No.45036836[source]
> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

replies(13): >>45036898 #>>45037054 #>>45037090 #>>45037874 #>>45038109 #>>45038360 #>>45038467 #>>45038827 #>>45043421 #>>45044645 #>>45046867 #>>45046969 #>>45047517 #
1. vorgol ◴[] No.45037090[source]
I was going to recommend that exact podcast episode but you beat me to it. Totally worth listening, especially if you're interested in software bugs.

Another interesting fact mentioned in the podcast is that the earlier (manually operated) version of the machine did have the same fault. But it also had a failsafe fuse that blew so the fault never materialized. Excellent demonstration of the Swiss Cheese Model: https://en.wikipedia.org/wiki/Swiss_cheese_model

replies(2): >>45038446 #>>45042423 #
2. bell-cot ◴[] No.45038446[source]
>> the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

> the earlier (manually operated) version of the machine did have the same fault. But it also had a failsafe fuse that blew so the fault never materialized.

#1 virtue of electromechanical failsafes is that their conception, design, implementation, and failure modes tend to be orthogonal to those of the software. One of the biggest shortcomings of Swiss Cheese safety thinking is that you too-often end up using "neighbor slices from the same wheel of cheese".

#2 virtue of electromechanical failsafes is that running into them (the fuse blew, or whatever) is usually more difficult for humans to ignore. Or at least it's easier to create processes and do training that actually gets the errors reported up the chain. (Compared to software - where the worker bees all know you gotta "ignore, click 'OK', retry, reboot" all the time, if you actually want to get anything done):

But, sadly, electromechanical failsafes are far more expensive then "we'll just add some code to check that" optimism. And PHB's all know that picking up nickles in front of the steamroller is how you get to the C-suite.

replies(2): >>45042868 #>>45044729 #
3. ipython ◴[] No.45042423[source]
Don’t worry we are poised to re learn all these lessons once again with our fancy new agentic generative ai systems.

The mechanical interlock essentially functioned as a limit outside of the control system. So you should build an ai system the same way- enforcing restrictions on the security agency from outside the control of the ai itself. Of course that doesn’t happen and devs naively trust that the ai can make its own security decisions.

Another lesson from that era we are re learning- in-band signaling. Our 2025 version of the “blue box” is in full swing. Prompt injection is just a side effect of the fact that there is no out of band instruction mechanism for llms.

Good news is - it’s not hard to learn the new technology when it’s just a matter of rediscovering the same security issues with a new name!

4. snerbles ◴[] No.45042868[source]
When I worked at an industrial integrator, we had a hard requirement for hard-wired e-stop circuits run by safety relays separate from the PLC. Sometimes we had to deal with dangerous OEM equipment that had software interlocks, and the solution was usually just to power the entire offending device down when someone hit an e-stop or opened a guarding panel.

About a decade ago a rep from Videojet straight up lied to us about their 30W CO2 marking laser having a hardware interlock. We found out when - in true Therac-25 fashion - the laser kept triggering despite the external e-stop being active due to a bug in their HMI touch panel. No one noticed until it eventually burned through the lens cap. In reality the interlock was a separate kit, and they left it out to reduce the cost for their bid to the customer. That whole incident really soured my opinion of them and reminded me of just how bad software "safety" can get.

replies(2): >>45044753 #>>45120124 #
5. WalterBright ◴[] No.45044729[source]
> And PHB's all know that picking up nickles in front of the steamroller is how you get to the C-suite.

Blaming it on PHB's is a mistake. There were no engineering classes in my degree program about failsafe design. I've known too many engineers who were insulted by my insinuations that their design had unacceptable failure modes. They thought they could write software that couldn't possibly fail. They'd also tell me that they could safely recover and continue executing a crashed program.

This is why I never, ever trust software switches to disable a microphone, software switches that disable disk writes, etc. The world is full of software bugs that enable overriding of their soft protections.

BTW, this is why airliners, despite their advanced computerized cockpit, still have an old fashioned turn-and-bank indicator that is independent of all that software.

replies(1): >>45047734 #
6. bombcar ◴[] No.45047734{3}[source]
Failsafe design is actually really fun when you start looking at all the scenarios and such.

But one key component is that IF a failsafe is triggered, it needs to be investigated as if it killed someone; because it should NEVER have triggered.

Without that part of the cycle, eventually the failsafe is removed or bypassed or otherwise ineffective, and the next incident will get you.

replies(1): >>45050906 #
7. WalterBright ◴[] No.45050906{4}[source]
Most airplane crashes are due to multiple failures. The accidents are investigated, and each failure is addressed and fixed.

The result is incredible safety.

replies(1): >>45053876 #
8. bombcar ◴[] No.45053876{5}[source]
People know about that; what they forget about is that any failure is noted and repaired (or deemed serviceable until repair).

Airplane reliability is from lots of failure analysis and work but also comprehensive maintenance plans and procedures.

9. I_dream_of_Geni ◴[] No.45120124{3}[source]
To be fair, reps don't really know anything deep about their product. They just parrot what they are told (or they wing it, which, i guess, can be lying). They are pushed to sell, and they will say anything to sell.