Details of the Cloudflare outage on July 2, 2019

Cloudflare lets their customers write their own WAF regex rules right?

And those rules still get run on every box on cloudflares edge network with HTTP requests from strangers on the internet right?

So how come this didn't get triggered by a customer first?

Perhaps it did get triggered by a customer first, but that customer didn't get too much traffic of the URL which triggers the issue, and that box got one thread stuck executing that regex for a few minutes till a health check killed it...? Does this imply that cloudflare runs with random failing health checks across the fleet and there isn't someone looking at core dumps of such failures?

That would align with my experience with seeing occasional "502 bad gateway" errors from cloudflare over the past few years. It also seems likely considering the incident where cloudflare servers leaked sensitive memory contents into HTTP responses which happened so frequently they got cached by google search. Hard to leak arbitrary memory contents without occasional SIGSEGV's...

If the above conjecture is true, it reflects very badly on engineering culture at Cloudflare. The core issue had been seen across the fleet sporadically for a long time, but was ignored, and even during the postmortem process, which should be a very thorough investigation, the telltale pre-warning signs of the issue were still missed.