←back to thread

462 points jakevoytko | 1 comments | | HN request time: 0.213s | source
1. taneq ◴[] No.43491766[source]
I had one that took literally years to reproduce. It was in PLC code, on a touchscreen controller running a soft PLC with Busybox under the hood. These devices were used 24/7 and usually absolutely bullet proof. Every now and then I’d get a comment that sometimes they’d crash on startup but a power cycle usually fixed it. Finally managed to get it to happen in the workshop, and dropped everything to try and figure it out.

The ultimate cause was in the network initialisation using a network library that was a tissue-paper-thin wrapper around Linux sockets. When downloading a new software version to the device, it would halt the PLC but this didn’t cleanly shut down open sockets, which would stay open, preventing a network service from starting until the unit was restarted. So I did the obvious thing and wrote the socket handle to a file. On startup I’d check the file and if it existed, shut that socket handle. This worked great during development.

Of course this file was still there after a power cycle. 99% of the time nothing would happen, but very occasionally, closing this random socket handle on startup would segfault the soft PLC runtime. So dumb, but so hard to actually catch in the wild.