The ultimate cause was in the network initialisation using a network library that was a tissue-paper-thin wrapper around Linux sockets. When downloading a new software version to the device, it would halt the PLC but this didn’t cleanly shut down open sockets, which would stay open, preventing a network service from starting until the unit was restarted. So I did the obvious thing and wrote the socket handle to a file. On startup I’d check the file and if it existed, shut that socket handle. This worked great during development.
Of course this file was still there after a power cycle. 99% of the time nothing would happen, but very occasionally, closing this random socket handle on startup would segfault the soft PLC runtime. So dumb, but so hard to actually catch in the wild.