Most active commenters

ChrisMarshallNY(8)
Cthulhu_(7)
pjmlp(5)
benrutter(4)
WalterBright(4)
(3)
0xDEAFBEAD(3)
wat10000(3)
franktankbank(3)
ozim(3)

Popular/hot comments

>>45044645 #
>>45037053 #
>>45038360 #
>>45044538 #
>>45038936 #
>>45037082 #
>>45038899 #
>>45040858 #

←back to thread

The Therac-25 Incident (2021)

(thedailywtf.com)

1. benrutter ◴[27 Aug 25 08:18 UTC] No.45036836[source]▶

>>45036294 (OP) #

> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

replies(13): >>45036898 #>>45037054 #>>45037090 #>>45037874 #>>45038109 #>>45038360 #>>45038467 #>>45038827 #>>45043421 #>>45044645 #>>45046867 #>>45046969 #>>45047517 #

2. AdamN ◴[27 Aug 25 08:31 UTC] No.45036898[source]▶

>>45036836 (TP) #

This is true but there also needs to be good developers as well. It can't just be great process and low quality developer practices. There needs to be: 1/ high quality individual processes (development being one of them), 2/ high quality delivery mechanisms, 3/ feedback loops to improve that quality, 4/ out of band mechanisms to inspect and improve the quality.

replies(1): >>45037053 #

3. Fr3dd1 ◴[27 Aug 25 08:56 UTC] No.45037053[source]▶

>>45036898 #

I would argue that a good process always has a good self correction mechanism built in. This way, the work done by a "low quality" software developer (this includes almost all of us at some point in time), is always taken into account by the process.

replies(6): >>45037082 #>>45037902 #>>45037927 #>>45038864 #>>45045154 #>>45050022 #

4. ◴[27 Aug 25 08:57 UTC] No.45037054[source]▶

>>45036836 (TP) #

5. quietbritishjim ◴[27 Aug 25 09:01 UTC] No.45037082{3}[source]▶

>>45037053 #

Right, but if everyone is low quality then there's no one to do that correction.

That may seem a bit hypothetical but it can easily happen if you have a company that systematically underpays, which I'm sure many of us don't need to think hard to imagine, in which case they will systematically hire poor developers (because those are the only ones that ever applied).

replies(3): >>45037428 #>>45037524 #>>45038431 #

6. vorgol ◴[27 Aug 25 09:03 UTC] No.45037090[source]▶

>>45036836 (TP) #

I was going to recommend that exact podcast episode but you beat me to it. Totally worth listening, especially if you're interested in software bugs.

Another interesting fact mentioned in the podcast is that the earlier (manually operated) version of the machine did have the same fault. But it also had a failsafe fuse that blew so the fault never materialized. Excellent demonstration of the Swiss Cheese Model: https://en.wikipedia.org/wiki/Swiss_cheese_model

replies(2): >>45038446 #>>45042423 #

7. ZaoLahma ◴[27 Aug 25 09:51 UTC] No.45037428{4}[source]▶

>>45037082 #

Replace the "hire poor developers" with "use LLM driven development", and you have the rough outline for a perfect Software Engineering horror movie.

It used to be that the poor performers (dangerous hip-shootin' code commitin' cowpokes) were limited in the amount of code that they could produce per time unit, leaving enough time for others to correct course. Now the cowpokes are producing ridiculous amount of code that you just can't keep up with.

replies(2): >>45045739 #>>45050049 #

8. anal_reactor ◴[27 Aug 25 10:05 UTC] No.45037524{4}[source]▶

>>45037082 #

Sad truth is that average dev is average, but it's not polite to say this out loud. This is particularly important at scale - when you are big tech at some point you hit a wall and no matter how much you pay you can't attract any more good devs, simply because all good devs are already hired. This means that corporate processes must be tailored for average dev, and exceptional devs can only exist in start-ups (or hermetically closed departments). The side effect of that is that whole job market promotes the skill of fitting into corporate environment over the skill of programming. So an a junior dev, for me it makes much more sense to learn how to promote my visibility during useless meetings, rather than learn a new technology. And that's how the bar keeps getting lower.

replies(2): >>45038943 #>>45040412 #

9. sonicggg ◴[27 Aug 25 10:50 UTC] No.45037874[source]▶

>>45036836 (TP) #

Not sure why the article is focusing so much on software development. That was just a piece of the problem. The entire product had design flaws. When the FDA for involved, the company wasn't just told to make software updates.

replies(1): >>45038272 #

10. varjag ◴[27 Aug 25 10:54 UTC] No.45037902{3}[source]▶

>>45037053 #

My takeaway from observing different teams over years is the talent by a huge margin is the most important component. Throw a team of A performers together and it really doesn't matter what process you make them jump through. This is how a waterfall team got the mankind to the Moon with handwoven core memory but an agile team 10x the size can't fix the software for a family car.

replies(1): >>45038557 #

11. rcxdude ◴[27 Aug 25 10:58 UTC] No.45037927{3}[source]▶

>>45037053 #

This only works with enough good developers involved in the process. I've seen how the sausage is made, and code quality is often shockingly low in these applications, just in ways that don't set off the metrics (or they do, but they can bend the process to wave them away). Also, the process often makes it very hard to fix latent problems in the software, so it rarely gets better over time, either.

12. speed_spread ◴[27 Aug 25 11:40 UTC] No.45038272[source]▶

>>45037874 #

Yet It doesn't take much to swamp a team of good developers. A poorly defined project, mismatched requirements, sent to production too early and then put in support mode with no time planned to plug the holes... There's only so much smart technicians can do when the organization is broken.

13. 0xDEAFBEAD ◴[27 Aug 25 11:50 UTC] No.45038360[source]▶

>>45036836 (TP) #

Honestly I wish instead of the Therac-25, we were discussing a system which made use of unit testing and defensive coding, yet still failed. That would be more educational. It's too easy to look at the Therac-25 and think "I would never write a mess like that".

replies(5): >>45038635 #>>45038899 #>>45042566 #>>45044920 #>>45046431 #

14. pjmlp ◴[27 Aug 25 11:57 UTC] No.45038431{4}[source]▶

>>45037082 #

The correction is done by the "lucky" souls doing the onsite, customer facing roles, for the offshoring delivery. Experience from a friend....

15. bell-cot ◴[27 Aug 25 11:58 UTC] No.45038446[source]▶

>>45037090 #

>> the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

> the earlier (manually operated) version of the machine did have the same fault. But it also had a failsafe fuse that blew so the fault never materialized.

#1 virtue of electromechanical failsafes is that their conception, design, implementation, and failure modes tend to be orthogonal to those of the software. One of the biggest shortcomings of Swiss Cheese safety thinking is that you too-often end up using "neighbor slices from the same wheel of cheese".

#2 virtue of electromechanical failsafes is that running into them (the fuse blew, or whatever) is usually more difficult for humans to ignore. Or at least it's easier to create processes and do training that actually gets the errors reported up the chain. (Compared to software - where the worker bees all know you gotta "ignore, click 'OK', retry, reboot" all the time, if you actually want to get anything done):

But, sadly, electromechanical failsafes are far more expensive then "we'll just add some code to check that" optimism. And PHB's all know that picking up nickles in front of the steamroller is how you get to the C-suite.

replies(2): >>45042868 #>>45044729 #

16. pjmlp ◴[27 Aug 25 12:00 UTC] No.45038467[source]▶

>>45036836 (TP) #

The worst part is that many devlopers think that by not working with high integrity systems, such quality levels don't apply to them.

Wrong, any software failure can have huge consequences in someone's life, or company, by preventing some critical flow to take place, corrupting data related to someone's life, professional or medical record, preventing a payment on some specific goods that had to be acquired on that moment or never,....

replies(1): >>45038893 #

17. scott_w ◴[27 Aug 25 12:09 UTC] No.45038557{4}[source]▶

>>45037902 #

You conflated, misrepresented and simply ignored so many things in your statement that I really don’t know where to start rebutting it. I’d say at least compare SpaceX to NASA with space exploration but, even then, I doubt you have anywhere near enough knowledge of both programmes to be able to properly analyse, compare and contrast to back up your claim. Hell, do you even know if SpaceX or Tesla are even using an agile methodology for their system development? I know I don’t.

That’s not to say talent is unimportant, however, I’d need to see some real examples of high talent, no process, teams compared to low talent, high process, teams, then some mixture of the groups to make a fair statement. Even then, how do you measure talent? I think I’m talented but I wouldn’t be surprised to learn others think I’m an imbecile who only knows Python!

replies(1): >>45038922 #

18. wat10000 ◴[27 Aug 25 12:16 UTC] No.45038635[source]▶

>>45038360 #

The lesson is not to write a mess like that. It might seem obvious, but it has to be learned.

replies(1): >>45044742 #

19. ChrisMarshallNY ◴[27 Aug 25 12:39 UTC] No.45038827[source]▶

>>45036836 (TP) #

I worked for a company that manufactured some of the highest-Quality photographic and scientific equipment that you can buy. It was expensive as hell, but our customers seemed to think it was worth it.

> It's the end result of a process

In my experience, it's even more than that. It's a culture.

replies(2): >>45038936 #>>45042622 #

20. franktankbank ◴[27 Aug 25 12:42 UTC] No.45038864{3}[source]▶

>>45037053 #

The process that makes this work would be so onerous to create. Would you think you could do this to make a low quality machinist be able to build a high quality technical part? What would this look like? Quite a lot like machine code which doesn't really reduce the requirements does it? It actually just shifted the onerous requirement somewhere else.

21. ozim ◴[27 Aug 25 12:45 UTC] No.45038893[source]▶

>>45038467 #

Hey don’t blame developers.

It is business who requests features ASAP to cut costs and and then there are customers who don’t want to pay for „ideal software” but rather have every software for free.

Most devs and QA workers I know want to deliver best quality software and usually are gold plating stuff anyway.

replies(2): >>45039133 #>>45050007 #

22. roeles ◴[27 Aug 25 12:45 UTC] No.45038899[source]▶

>>45038360 #

One instance that crosses my mind often is the airbus a320 incident at Hamburg in 2008. Everything was done right there, but the requirements were wrong.

Despite all the procedures and tests, the software still managed to endanger the lives of the passengers.

replies(3): >>45039042 #>>45045268 #>>45052435 #

23. varjag ◴[27 Aug 25 12:47 UTC] No.45038922{5}[source]▶

>>45038557 #

Hell, do you even know if SpaceX or Tesla are even using an agile methodology for their system development?

What I've been saying is methodology is mostly irrelevant, not that waterfall is specifically better than agile. Talent wins over the process but I can see how this idea is controversial.

I’d need to see some real examples of high talent, no process, teams compared to low talent, high process, teams, then some mixture of the groups to make a fair statement. Even then, how do you measure talent?

Yep, even if I made it my life's mission to run a formal study on programmer productivity (which I clearly won't) that wouldn't save the argument from nitpicking.

replies(1): >>45040483 #

24. franktankbank ◴[27 Aug 25 12:49 UTC] No.45038936[source]▶

>>45038827 #

A culture of high-quality engineering, no doubt. Made up of: high quality engineers!

replies(4): >>45039171 #>>45039214 #>>45040858 #>>45044538 #

25. ozim ◴[27 Aug 25 12:50 UTC] No.45038943{5}[source]▶

>>45037524 #

Huh, sad truth?

But average construction worker is also average and average doctor as well.

World cannot be running on „best of the best” - just wrap your head around the fact whole economy and human activities are run by average people doing average stuff.

26. 0xDEAFBEAD ◴[27 Aug 25 12:59 UTC] No.45039042{3}[source]▶

>>45038899 #

Interesting, do you happen to have a case study?

replies(2): >>45039341 #>>45039438 #

27. pjmlp ◴[27 Aug 25 13:06 UTC] No.45039133{3}[source]▶

>>45038893 #

Being a real Software Engineer, those that actually have the proper title, eventually with the final examination, means being able to deliver the best product within the set of given constraints.

Also, speaking out when the train is visibly going against a wall.

replies(2): >>45040598 #>>45041357 #

28. ChrisMarshallNY ◴[27 Aug 25 13:09 UTC] No.45039171{3}[source]▶

>>45038936 #

Yes, but some of them were the most stubborn bastards I've ever worked with.

replies(2): >>45039449 #>>45042921 #

29. anonymars ◴[27 Aug 25 13:12 UTC] No.45039214{3}[source]▶

>>45038936 #

Isn't that exactly the opposite of the point being made?

> software quality doesn't appear because you have good developers

replies(1): >>45039274 #

30. ChrisMarshallNY ◴[27 Aug 25 13:16 UTC] No.45039274{4}[source]▶

>>45039214 #

Not really.

Good developers are a necessary ingredient of a much larger recipe.

People think that a good process means you can toss in crap developers, or that great developers mean that you can have a bad process.

In my experience, I worked for a 100-year-old Japanese engineering company that had a decades-long culture of Quality. People stayed at that company for their entire career, and most of them were top-shelf people. They had entire business units, dedicated to process improvement and QA.

It was a combination of good talent, good process, and good culture. If any one of them sucks, so does the product.

replies(1): >>45040960 #

31. ◴[27 Aug 25 13:23 UTC] No.45039341{4}[source]▶

>>45039042 #

32. shmeeed ◴[27 Aug 25 13:32 UTC] No.45039438{4}[source]▶

>>45039042 #

https://skybrary.aero/sites/default/files/bookshelf/1258.pdf

33. franktankbank ◴[27 Aug 25 13:32 UTC] No.45039449{4}[source]▶

>>45039171 #

That's high praise I'm sure.

replies(1): >>45039581 #

34. ChrisMarshallNY ◴[27 Aug 25 13:44 UTC] No.45039581{5}[source]▶

>>45039449 #

The results spoke for themselves, but it could be maddening.

35. orochimaaru ◴[27 Aug 25 14:42 UTC] No.45040412{5}[source]▶

>>45037524 #

Learning new technologies wasn’t the issue with the Therac. In fact as someone who has been coding and leading sw engineering teams for the past 28 yrs, I don’t like “new technologies”. When someone does this awesome complicated async state machine using a large set of brittle components alarm bells go off and I make it my life’s mission to make it as simple as it needs to be.

A lot of times that is boring meetings to discuss the simplification.

I can extend the same analogy to all the gen ai bs that’s floating around right now as well.

36. scott_w ◴[27 Aug 25 14:47 UTC] No.45040483{6}[source]▶

>>45038922 #

Except your only example was nonsensical on the face of it.

> Yep, even if I made it my life's mission to run a formal study on programmer productivity (which I clearly won't) that wouldn't save the argument from nitpicking.

I didn't ask for this, I just asked for sensible examples, either from your experience or from publicly available information.

37. bobmcnamara ◴[27 Aug 25 14:57 UTC] No.45040598{4}[source]▶

>>45039133 #

In my country those are a myth. We had it as a professional engineering classification for a while but I'm not sure if anyone ever completed it. They cancelled it several years ago.

replies(1): >>45040802 #

38. pjmlp ◴[27 Aug 25 15:09 UTC] No.45040802{5}[source]▶

>>45040598 #

And in US anyone can call themselves whatever they feel like.

If you want professional quality, we're the first line of actually making it happen, blaming others won't change anything.

replies(1): >>45046787 #

39. herval ◴[27 Aug 25 15:14 UTC] No.45040858{3}[source]▶

>>45038936 #

you don't need "high quality engineers" to have high-quality outputs. And vice-versa - lots of places with very high quality engineers produce terribly low-quality software

replies(3): >>45041010 #>>45046913 #>>45049974 #

40. ChrisMarshallNY ◴[27 Aug 25 15:23 UTC] No.45040960{5}[source]▶

>>45039274 #

It really is funny how any discussion of improving software quality becomes unpopular, here.

replies(1): >>45045346 #

41. ChrisMarshallNY ◴[27 Aug 25 15:25 UTC] No.45041010{4}[source]▶

>>45040858 #

I guess we see things differently.

They don't need to be especially talented engineers, but, in my experience (and I actually have quite a bit of it, in this area), they need to be dedicated to a culture of Quality.

And it is entirely possible for very talented engineers to produce shite. I've seen exactly that.

replies(1): >>45047204 #

42. ozim ◴[27 Aug 25 15:53 UTC] No.45041357{4}[source]▶

>>45039133 #

You mean like those in the article:

https://edition.cnn.com/2025/08/27/us/alaska-f-35-crash-acci...

replies(1): >>45049012 #

43. ipython ◴[27 Aug 25 17:25 UTC] No.45042423[source]▶

>>45037090 #

Don’t worry we are poised to re learn all these lessons once again with our fancy new agentic generative ai systems.

The mechanical interlock essentially functioned as a limit outside of the control system. So you should build an ai system the same way- enforcing restrictions on the security agency from outside the control of the ai itself. Of course that doesn’t happen and devs naively trust that the ai can make its own security decisions.

Another lesson from that era we are re learning- in-band signaling. Our 2025 version of the “blue box” is in full swing. Prompt injection is just a side effect of the fact that there is no out of band instruction mechanism for llms.

Good news is - it’s not hard to learn the new technology when it’s just a matter of rediscovering the same security issues with a new name!

44. hinkley ◴[27 Aug 25 17:37 UTC] No.45042566[source]▶

>>45038360 #

I bring up Knight Capital every time people start acting like feature toggles will solve every problem we have with feature rollout.

KC lost over $400 million in less than an hour due to an old feature toggle and a problem with their deployment process.

45. f1shy ◴[27 Aug 25 17:40 UTC] No.45042622[source]▶

>>45038827 #

It is a culture. No doubt. And certainly not only processes. I work in a company where there are literally processes for everything. For every engineer doing actual work (requirement engineering, architecture, coding, testing) there are at least 3!doing processes. The SW we make is the ultimative piece of shit, late, expensive, and to the brim full of bugs. So process is importannt, but good engineering, and culture also.

46. snerbles ◴[27 Aug 25 17:59 UTC] No.45042868{3}[source]▶

>>45038446 #

When I worked at an industrial integrator, we had a hard requirement for hard-wired e-stop circuits run by safety relays separate from the PLC. Sometimes we had to deal with dangerous OEM equipment that had software interlocks, and the solution was usually just to power the entire offending device down when someone hit an e-stop or opened a guarding panel.

About a decade ago a rep from Videojet straight up lied to us about their 30W CO2 marking laser having a hardware interlock. We found out when - in true Therac-25 fashion - the laser kept triggering despite the external e-stop being active due to a bug in their HMI touch panel. No one noticed until it eventually burned through the lens cap. In reality the interlock was a separate kit, and they left it out to reduce the cost for their bid to the customer. That whole incident really soured my opinion of them and reminded me of just how bad software "safety" can get.

replies(2): >>45044753 #>>45120124 #

47. snerbles ◴[27 Aug 25 18:03 UTC] No.45042921{4}[source]▶

>>45039171 #

They have to be.

48. chairmansteve ◴[27 Aug 25 18:45 UTC] No.45043421[source]▶

>>45036836 (TP) #

I would say the real lesson is that the Therac machine should have had hardware interlocks (mentioned but not emphasised in the article).

49. kulahan ◴[27 Aug 25 20:10 UTC] No.45044538{3}[source]▶

>>45038936 #

Unfortunately, software developers are the absolute most offensive use of the word "engineer", because 99.9% of the stuff this field makes is a competition to take the most unique approach to a solution, then getting it bandaged together with gum and paperclips.

If this industry wants to be respected, it should start trying to be actual engineers. There should be tons and tons of standards which are enforced legally, but this is not often the case. Imagine if there were no real legal guardrails in, say, bridge building!

edit: and imagine if any time you brought up this issue, bridge builders cockily responded with "well stuff seems to work fine so..."

replies(5): >>45044851 #>>45045578 #>>45048657 #>>45049963 #>>45120141 #

50. WalterBright ◴[27 Aug 25 20:18 UTC] No.45044645[source]▶

>>45036836 (TP) #

I'm going to disagree.

I have years of experience at Boeing designing aircraft parts. The guiding principle is that no single failure should cause an accident.

The way to accomplish this is not "write quality software", nor is it "test the software thoroughly". The idea is "assume the software does the worst possible thing. Then make sure that there's an independent system that will prevent that worst case."

For the Therac-25, that means a detector of the amount of radiation being generated, which will cut it off if it exceeds a safe value. I'd also add that the radiation generator be physically incapable of generating excessive radiation.

replies(9): >>45045090 #>>45045473 #>>45046078 #>>45046192 #>>45047920 #>>45048437 #>>45048717 #>>45049878 #>>45049910 #

51. WalterBright ◴[27 Aug 25 20:25 UTC] No.45044729{3}[source]▶

>>45038446 #

> And PHB's all know that picking up nickles in front of the steamroller is how you get to the C-suite.

Blaming it on PHB's is a mistake. There were no engineering classes in my degree program about failsafe design. I've known too many engineers who were insulted by my insinuations that their design had unacceptable failure modes. They thought they could write software that couldn't possibly fail. They'd also tell me that they could safely recover and continue executing a crashed program.

This is why I never, ever trust software switches to disable a microphone, software switches that disable disk writes, etc. The world is full of software bugs that enable overriding of their soft protections.

BTW, this is why airliners, despite their advanced computerized cockpit, still have an old fashioned turn-and-bank indicator that is independent of all that software.

replies(1): >>45047734 #

52. kccqzy ◴[27 Aug 25 20:27 UTC] No.45044742{3}[source]▶

>>45038635 #

Software engineering has advanced in the past few decades that the kind of code considered a "mess" has expanded.

replies(1): >>45047468 #

53. jopsen ◴[27 Aug 25 20:39 UTC] No.45044920[source]▶

>>45038360 #

I'd agree, it's super easy to think such errors won't happen had they just used a fairly safe language and sane architecture. Or unit test, race detectors, etc.

I suspect that few organizations that do all that, have a process/culture of ignoring bugs in the wild -- and those that do have such complicated domains that explaining the error is hard.

Software best practices today would probably also involve sending metrics, logs, error reports, etc.

That said, it's still extremely easy get embrace a culture were unexplainable errors are ignored. Especially in a cloud environment.

54. vjvjvjvjghv ◴[27 Aug 25 20:51 UTC] No.45045090[source]▶

>>45044645 #

In general I agree but there is bit more complexity. I work in medical devices and there are plenty of situations where a certain output is ok in some circumstance but deadly in another. That makes a stopgap a little more tricky.

I agree with the previous poster that the feedback from the field is lacking a lot. A lot of doctors don’t report problems back because they are used to bad interfaces. And then the feedback gets filtered through several layers of sales reps and product management. So a lot of info gets lost and fixes that could be simple won’t get done.

In general when you work in medical you are so overwhelmed by documentation and regulation that there isn’t much time left to do proper engineering. The FDA mostly looks at documentation done right and less at product done right.

replies(2): >>45045452 #>>45045655 #

55. vjvjvjvjghv ◴[27 Aug 25 20:56 UTC] No.45045154{3}[source]▶

>>45037053 #

“ This way, the work done by a "low quality" software developer (this includes almost all of us at some point in time), is always taken into account by the process”

That’s a horrible take. There is no amount of reviews, guidelines and documentation that can compensate for low quality devs. You can’t throw garbage into the pipeline and then somehow process it to gold.

56. I_dream_of_Geni ◴[27 Aug 25 21:04 UTC] No.45045268{3}[source]▶

>>45038899 #

Speaking of Airbus, They 'lost' 3-4 different aircraft (from 1988 to 2015) which crashed during development, or, spectacularly during their first airshow. Never slowed down their customers at ALL, and to this day, Boeing has never lost one new commercial airliner in those same circumstances. Yet, Boeing gets all the hate. smh

57. ponector ◴[27 Aug 25 21:12 UTC] No.45045346{6}[source]▶

>>45040960 #

Not only here. Everyone wants to use a quality products, but no almost one is committed to deliver quality products/services.

No one is working on quality, everyone works on new features. There is usually no incentive to increase quality, to improve speed, performance, etc.

58. WalterBright ◴[27 Aug 25 21:22 UTC] No.45045452{3}[source]▶

>>45045090 #

At Boeing there's a required "failure analysis" document listing all the failure modes and why they won't cause a crash by themselves.

replies(2): >>45048051 #>>45049126 #

59. graypegg ◴[27 Aug 25 21:24 UTC] No.45045473[source]▶

>>45044645 #

I think the range of radiation dose might vary too much to make a radiation source a totally isolated system, but trying to keep it as a simple physical lockout, I could imagine part of the start up process involving inserting a small module containing a fuse that breaks at a certain current that could be swapped out for different therapies or something. Could even add a simple spring+electromagnet mechanism that kicks that module out when power gets cut so radiotechs have to at least acknowledge the fuse before start up each time.

I will say that me pretending to know how to best design medical equipment as a web developer is pretty full of myself haha. Highly doubt whatever I'm spouting is a new idea. The idea of working on this sort of high-reliability + high-recoverability systems seems really interesting though!

60. ChrisMarshallNY ◴[27 Aug 25 21:34 UTC] No.45045578{4}[source]▶

>>45044538 #

Yeah, this is a fairly classic challenge.

When it comes to safety stuff (like bridge building), there are (and should be) strict licensing requirements. I would have no problem requiring such for work on things like medical equipment, We already require security clearance for things like defense information (unless you're a DOGE bro, I guess). That's a bit different from engineering creds, but it's an example of imposed structure.

But I think that it would be ridiculous to require it for someone that writes a fart app (unless it's a weaponized fart app).

What is in those requirements then becomes a hot potato. There are folks that would insist that any "engineer" be required to know how to use a slide rule, and would ignore modern constructs like LLMs and ML.

I'm not kidding. I know people exactly like that. If they get authority, watch out. They'll only "approve" stuff that they are good at.

On the other hand, if the requirements are LeetCode, then it's useless. A lot of very unqualified people would easily pass, and wreak havoc.

From what I can see, the IEEE seems to have a fairly good grasp on mixing classic structure and current tech. There's some very good people, there, and they are used to working in a structured manner.

But software has developed a YOLO culture. People are used to having almost no structure, and they flit between organizations so rapidly, that it's almost impossible to keep track of who is working on what.

The entire engineering culture needs to be changed. I don't see that being something that will come easily.

I'm big on Structure and Discipline. A lot of it has to do with almost 27 years at a corporation with so much structure that a lot of folks here, would be whimpering under their standing desks.

That structure was required, in order to develop equipment of the Quality they are famous for, but would be total overkill for a lot of stuff.

I do think that we need to impose structure on software supply chains, though. That's not something that will be a popular stance.

Structure is also not cheap. Someone needs to pay for it, and that's when you become a real skunk at the picnic.

61. darepublic ◴[27 Aug 25 21:41 UTC] No.45045655{3}[source]▶

>>45045090 #

16000 - 25000 rads right. Not safe under any circumstance?

replies(1): >>45046950 #

62. ponector ◴[27 Aug 25 21:48 UTC] No.45045739{5}[source]▶

>>45037428 #

Isn't it a future we are moving to? To hire poor (cheap) developers to do LLM-driven development.

63. philjohn ◴[27 Aug 25 22:22 UTC] No.45046078[source]▶

>>45044645 #

This.

One of the biggest things I see in junior engineers that I mentor (working in backend high throughput, low latency, distributed systems) is not working out all of the various failure modes your system will likely encounter.

Network partitions, primary database outage, caching layer outage, increased latency ... all of these things can throw a spanner in the works, but until you've experienced them (or had a strong mentor guide you) it's all abstract and difficult to see when the happy path is right there.

I've recently entirely re-architected a critical component, and part of this was defense in depth. Stuff is going to go wrong, so having a second or even third line of defense is important.

replies(2): >>45047190 #>>45048063 #

64. layman51 ◴[27 Aug 25 22:36 UTC] No.45046192[source]▶

>>45044645 #

I might also add that apparently, older versions of the machine had physical “hardware interlocks” that would make accidents less likely no matter what the software was doing. So the older software was probably just thought to be reliable, but it had a physical mechanism that was helping it to not kill someone. On a less serious note that’s part of why car doors might still have keyholes even if normally they open in a fancy way with electronic fobs.

65. jldugger ◴[27 Aug 25 23:10 UTC] No.45046431[source]▶

>>45038360 #

Perhaps this is why the cover of my software correctness book in undergrad used a series of stills from the arianne-5 disaster[1] for the cover.

[1]: https://en.wikipedia.org/wiki/Ariane_5#Notable_launches

66. bobmcnamara ◴[28 Aug 25 00:05 UTC] No.45046787{6}[source]▶

>>45040802 #

Turns out there could be up to 81 US professional software engineers!

https://ncees.org/ncees-discontinuing-pe-software-engineerin...

67. credit_guy ◴[28 Aug 25 00:15 UTC] No.45046867[source]▶

>>45036836 (TP) #

I'm not sure. Most software (by orders of magnitude) is not critical software like the software running that X-ray machine. In general, if your software fails, a page loads too slow, or a report comes with lots of NaN's, or some batch job does not run at the right time, and someone needs to start it manually, etc. The cases where someone dies because of a software quality issue are very rare, and the developers working on that type of software know who they are and what their duties are (I hope).

68. fesenjoonior ◴[28 Aug 25 00:22 UTC] No.45046913{4}[source]▶

>>45040858 #

> you don't need "high quality engineers" to have high-quality outputs.

[citation needed]

replies(1): >>45047189 #

69. sgerenser ◴[28 Aug 25 00:27 UTC] No.45046950{4}[source]▶

>>45045655 #

Completely safe as long as the block of metal was in place. So you couldn’t just prevent the machine from putting out that much energy, you had to prevent it from doing that without the block in place.

replies(1): >>45048706 #

70. rowanG077 ◴[28 Aug 25 00:32 UTC] No.45046969[source]▶

>>45036836 (TP) #

I think the opposite. The only reason software quality emerges is because of good developers. It's a prerequisites. Process helps good developers deliver quality. But there is no process that allows a team of bad developers deliver quality. you can't squeeze blood from a stone.

replies(1): >>45050179 #

71. herval ◴[28 Aug 25 01:12 UTC] No.45047189{5}[source]▶

>>45046913 #

37signals and craigslist are two examples (and they’re open about their engineering being sub-par). If you consider FB products “high quality”, it’s another example (the average FB developer is anything but “high quality”, by most definitions). Palantir is another example, with a horde of junior engineers and famous for bad practices (yet here they are commanding the US military). And so on and so on. The inverse is also true - plenty of stellar teams producing irrelevant or low impact products.

replies(1): >>45048619 #

72. ◴[28 Aug 25 01:12 UTC] No.45047190{3}[source]▶

>>45046078 #

73. herval ◴[28 Aug 25 01:15 UTC] No.45047204{5}[source]▶

>>45041010 #

A culture of quality doesn’t require particularly skilled individuals to function.

That’s in fact the thesis for the entire Deming management philosophy, and in line with what I’m saying (you can produce high quality with a good process or a good culture, you don’t necessarily need high caliber individuals)

replies(1): >>45048112 #

74. wat10000 ◴[28 Aug 25 01:55 UTC] No.45047468{4}[source]▶

>>45044742 #

We’ve invented entirely new ways to write bad code.

replies(1): >>45050181 #

75. msy ◴[28 Aug 25 02:03 UTC] No.45047517[source]▶

>>45036836 (TP) #

I couldn't disagree more. Outside of exotic scenarios where things like formal proofing are possible and economically viable I've never seen a process that prevents bugs, only culture. Good engineering cultures are also often ones with well defined and tested processes and good testing practices but it's the culture and people giving a shit that makes the difference, not the other way around.

Good product cultures are ones where natural communication between the field and engineering would mean issues get reported back up and make their way to the right people. No process will compensate for people not giving a shit.

replies(1): >>45050134 #

76. bombcar ◴[28 Aug 25 02:39 UTC] No.45047734{4}[source]▶

>>45044729 #

Failsafe design is actually really fun when you start looking at all the scenarios and such.

But one key component is that IF a failsafe is triggered, it needs to be investigated as if it killed someone; because it should NEVER have triggered.

Without that part of the cycle, eventually the failsafe is removed or bypassed or otherwise ineffective, and the next incident will get you.

replies(1): >>45050906 #

77. technofiend ◴[28 Aug 25 03:29 UTC] No.45048063{3}[source]▶

>>45046078 #

I recently had to argue a junior into leaving the health check frequency alone on an ECS container: the regular log entries annoyed her and she didn't know how to filter logs, so her solution was to take healthchecks down to every five minutes, as just one example of trying to talk to people about the unhappy path.

replies(1): >>45048985 #

78. ChrisMarshallNY ◴[28 Aug 25 03:36 UTC] No.45048112{6}[source]▶

>>45047204 #

Japan used a lot of Deming’s theories, to significant success. I worked for a Japanese company.

In my case, the company produced absolutely top-shelf stuff, but even relatively mediocre companies did well, using Deming’s techniques. It required that everyone be on board, wrt the culture, though.

But I have found that a “good” engineer is one that takes their vocation seriously. They may not be that accomplished or skilled, but they have self-discipline, humility, and structure.

I’ve met quite a few highly-skilled “not-good” engineers, in my day. I’m embarrassed to say that I’ve hired some of them.

79. jonahx ◴[28 Aug 25 04:40 UTC] No.45048437[source]▶

>>45044645 #

That makes sense. But wouldn't the "write quality software" and "test the software thoroughly" still be relevant to the individual pieces? If the chance of a catastrophic failure is the product of the failure rates of the pieces, getting P(PartFail) low helps too -- even if having multiple backups is the main source of protection.

80. jeltz ◴[28 Aug 25 05:12 UTC] No.45048619{6}[source]▶

>>45047189 #

FB products are low quality as is 37 signals. I don't think you necessarily need geniuses to build good products but your examples are bad.

81. waste_monk ◴[28 Aug 25 05:19 UTC] No.45048657{4}[source]▶

>>45044538 #

My university offered (not sure if it still exists) a software engineer degree that was part of the school of engineering, had a mix of courses from the school of engineering and school of math and IT, and at the end of it you would be eligible to take the local equivalent of the PE exam and become a Real Engineer (I think technically you would be in the field of mechatronics).

IMO this should be the standard - software engineer should be a protected title, and everyone else would be titled some flavour of software developer or similar.

replies(1): >>45071172 #

82. Gud ◴[28 Aug 25 05:25 UTC] No.45048706{5}[source]▶

>>45046950 #

So there should have been an interlocking system

replies(1): >>45049829 #

83. fulafel ◴[28 Aug 25 05:26 UTC] No.45048717[source]▶

>>45044645 #

The GP didn't propose processes in the sw engineering part though but "the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed"

84. dwedge ◴[28 Aug 25 06:12 UTC] No.45048985{4}[source]▶

>>45048063 #

That sounds more like a disaster waiting to happen than a junior. I find it difficult to believe that she didn't know the purpose of the healthcheck, so it sounds like breaking (someone else's problem) instead of addressing gaps in ability

replies(1): >>45049918 #

85. pjmlp ◴[28 Aug 25 06:16 UTC] No.45049012{5}[source]▶

>>45041357 #

Yeah, and I bet responsabilities will be taken care of, unlike in most software projects.

86. Aloha ◴[28 Aug 25 06:37 UTC] No.45049126{4}[source]▶

>>45045452 #

Agreed - This is essentially the corner stone of systems failure analysis - something I wish architects thought about more in the software space.

I'm a product manager for an old (and if I'm being honest somewhat crusty) system of software, the software is buggy, all of it is, but its also self healing and resilient, so while yes, it fails with somewhat alarming regularity, with very lots and lots concerning looking error messages in the logs, but it never causes an outage because it self heals.

Good systems design isn't making bug free software or a bug free system, but rather a system where a total outage requires N+1 (maybe even N+N) things to fail before the end user notices. Failures should be driven by at most, edge cases - basically where the system is being operated outside of its design parameters, and those parameters need to reflect the real world and be known by most stakeholders in the system.

My gripe with software engineers sometimes, they're often too divorced from real users and real use cases, and too devoted to the written spec over what their users actually need to do with the software - I've seen some very elegant (and on paper, well designed) systems fall apart because if simple things like intermittent packet jitter, or latency swings (say between 10ms and 70ms) - these are real world conditions, often encountered by real world systems, but these spec driven systems fall apart once confronted with reality.

87. tech2 ◴[28 Aug 25 08:31 UTC] No.45049829{6}[source]▶

>>45048706 #

The earlier model that the 25 replaced was all mechanically interlocked. The belief was that software provided that same level of assurance. They performed manual testing but what they weren't able to do was reach a level of speed and fluency with the system to result in the failure modes which caused the issues. Lower hardware costs equals higher profit...

88. benrutter ◴[28 Aug 25 08:41 UTC] No.45049878[source]▶

>>45044645 #

It's not that I don't think that's important, but I think with failure you always have an issue around needing N+1 checks (please don't take this as an argument against checks though).

The Therac-25 was meant to have a detector of radiation levels to cut things off if a safe value was exceeded, but it didn't work. It could obviously have been improved, but you always have the possibility that "what if our check doesn't work?".

In the case of the Therac-25, if the first initial failures had been reported and investigated, my understanding is (I should make clear I'm not an expert here) it would have made the issues apparent, and it could have been recalled before any of the fatal incidents happened.

In a swiss cheese model of risk, you always want as many layers as possible, so your point about a detector fits in there, but the final layer should always be if an incident does happen, and something gets past all our checks, how can we make it likely that it gets investigated fully by the right person.

89. Cthulhu_ ◴[28 Aug 25 08:46 UTC] No.45049910[source]▶

>>45044645 #

Great point. Earlier in my career, and I think many can see it too, I was very diligent; thorough types, unit tests, defensive programming, assertions at one point, the works.

But this opens up a can of worms, as suddenly you have to deal with every edge case, test for every possible input, etc. This was before fuzz testing, too. Each line of defensive coding, every carefully crafted comment, etc all added to the maintenance burden; I'd even go as far as claim it increased uncertainty, because what if I forgot something?

15 years later and it feels like I'm doing far less advanced stuff (although in hindsight what I did then wasn't all that, but I made it advanced). One issue came up recently; a generic button component would render really tall if no label was given, which happened when a CMS editor did not fill in a label in an attempt to hide it. The knee-jerk response would be to add a check that disallows empty labels, or to not render the button if no label is given, or to use a default button label.

But now I think I'll look at the rendering bug and just... leave the rest. A button with an empty label isn't catastrophic. Writing rules for every possible edge case (empty label, whitespaces, UTF-8 characters escaping the bounds, too long text, too short text, non-text, the list goes on) just adds maintenance and complexity. And it's just a button.

90. Cthulhu_ ◴[28 Aug 25 08:48 UTC] No.45049918{5}[source]▶

>>45048985 #

The junior part there is that this person still believes they can / should read and comprehend all logs themselves. This just isn't viable at scale.

But same with code itself, a junior will have code that is "theirs", a medior/senior will (likely) work at scales where they can't keep it all in their heads. And that's when all the software development best practices come into play.

91. Cthulhu_ ◴[28 Aug 25 08:56 UTC] No.45049963{4}[source]▶

>>45044538 #

There are tons and tons of standards, some of which are enforced legally - you can't just supply software for the government, the military, banks, companies etc without certain certifications like ISO 9001, ISO/IEC 27001, etc

Now I'm not an engineer nor at all aware of what these standards actually mean, I'm sure they're pretty common sense and nowhere near as detailed as bridge building standards.

92. Cthulhu_ ◴[28 Aug 25 08:58 UTC] No.45049974{4}[source]▶

>>45040858 #

The article and GP mentions this as well in a roundabout fashion; a high-quality engineer is a waste if the organization around it fails. It's better to have mediocre developers in a mature organization than a hero developer working in the shadows. I've seen a few.

93. Cthulhu_ ◴[28 Aug 25 09:03 UTC] No.45050007{3}[source]▶

>>45038893 #

Business can request it, but it's your job as a software engineer to build quality software; don't shift the blame.

Does a construction engineer blame an architect's wacky designs if a building collapses? No, they either engineer it so it doesn't collapse, convince the architect that it will collapse because physics, or they refuse.

People want to be able to use a bridge for free too, doesn't mean there's no money in it.

As for gold plating, is that really improving software quality, or is that yak shaving / bike shedding?

94. Cthulhu_ ◴[28 Aug 25 09:05 UTC] No.45050022{3}[source]▶

>>45037053 #

> (this includes almost all of us at some point in time)

I'd say this includes all of us all the time; a good developer never trusts their own work blindly, and spends more time gathering requirements and verifying their and others' work than writing code.

95. Cthulhu_ ◴[28 Aug 25 09:09 UTC] No.45050049{5}[source]▶

>>45037428 #

This is why at every software project I've done in the past 15 odd years, steps were taken to prevent this in an automated and standardized fashion; code reviews of course, but they're more for functionality. Unit test requirements, integration / end-to-end tests based on acceptance criteria, visual regression tests, linting, type systems, OTAP, CI/CD, audit log via Git and standardized commit messages, etc etc etc.

My job hasn't significantly changed with AI, as AI generated code still has to pass all the hurdles I've set up while setting up this project.

96. benrutter ◴[28 Aug 25 09:23 UTC] No.45050134[source]▶

>>45047517 #

> It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

I think the bit I quoted, especially if you read in the context of the article, is talking about culture. I.e. it's talking about a process that informs software development, management and sales. Things like formal proofing and type systems are the exact kind of processes that aren't what it's talking about.

I kind of agree with you though about the process/culture distinction - ultimately, if you don't have a culture where people actively care about improving reliability, any process is just gonna become a tick-box exercise to appease management.

97. benrutter ◴[28 Aug 25 09:29 UTC] No.45050179[source]▶

>>45046969 #

That's true, but even so - great developers still make mistakes, and if they don't hear about production errors because of a breakdown in the customer communications from sales etc, then those mistakes will never be fixed.

It's not that great developers aren't necessary for software quality, more that they aren't sufficient.

replies(1): >>45061480 #

98. 0xDEAFBEAD ◴[28 Aug 25 09:29 UTC] No.45050181{5}[source]▶

>>45047468 #

Examples?

replies(1): >>45052610 #

99. WalterBright ◴[28 Aug 25 11:35 UTC] No.45050906{5}[source]▶

>>45047734 #

Most airplane crashes are due to multiple failures. The accidents are investigated, and each failure is addressed and fixed.

The result is incredible safety.

replies(1): >>45053876 #

100. Izkata ◴[28 Aug 25 14:14 UTC] No.45052435{3}[source]▶

>>45038899 #

The Boeing 737 MAX had an additional safety feature that was causing crashes due to bad input from the sensors, that pilots didn't know about so they couldn't override. This was 2018 and 2019. After the first crash, the manuals and training were updated to explain what was going on and how to override it.

https://en.wikipedia.org/wiki/Maneuvering_Characteristics_Au...

101. wat10000 ◴[28 Aug 25 14:29 UTC] No.45052610{6}[source]▶

>>45050181 #

Dependency managers, Agile, Electron, AI, Java.

102. bombcar ◴[28 Aug 25 16:05 UTC] No.45053876{6}[source]▶

>>45050906 #

People know about that; what they forget about is that any failure is noted and repaired (or deemed serviceable until repair).

Airplane reliability is from lots of failure analysis and work but also comprehensive maintenance plans and procedures.

103. rowanG077 ◴[29 Aug 25 08:10 UTC] No.45061480{3}[source]▶

>>45050179 #

I think great developers are necessary for software quality. But indeed often not sufficient.

104. kulahan ◴[30 Aug 25 01:35 UTC] No.45071172{5}[source]▶

>>45048657 #

This is absolutely my belief as well. I think it currently detracts from the title of other, Real Engineers.

105. I_dream_of_Geni ◴[03 Sep 25 20:36 UTC] No.45120124{4}[source]▶

>>45042868 #

To be fair, reps don't really know anything deep about their product. They just parrot what they are told (or they wing it, which, i guess, can be lying). They are pushed to sell, and they will say anything to sell.

106. I_dream_of_Geni ◴[03 Sep 25 20:37 UTC] No.45120141{4}[source]▶

>>45044538 #

"Software developers are the absolute most offensive use of the word "engineer".

This, exactly this! I am a retired Aerospace engineer. 8 years of engineering studies at college and 2 years of work study before I was hired.

My son considers himself a "software engineer" and I have told him many times that he is NOT an engineer. He was homeschooled (so, pat my back there), never went to college, never studied programming at all. Yet he makes between $200K-$250K per year, 5 TIMES what I made as a Senior Engineer at Boeing. smh

↑