1. An engineer wrote a regular expression that could easily backtrack enormously.
2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
3. The regular expression engine being used didn’t have complexity guarantees.
4. The test suite didn’t have a way of identifying excessive CPU consumption.
5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
6. The rollback plan required running the complete WAF build twice taking too long.
7. The first alert for the global traffic drop took too long to fire.
8. We didn’t update our status page quickly enough.
9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.
Here's my version of what went wrong: 1. The process for composing complex regular expressions is "engineer tries to shove a lot of symbols into a line" rather than "compile/compose regex programmatically from individual matches"
2. Production services had no service health watchdog (the kind of thing that makes systemd stop re-running services that repeatedly hang/die)
3. Performance testing/quality assurance not done before releasing changes (this is not CI/CD)
4. No gradual rollout
5. No testing of rollbacks
6. Lack of emergency response plans / training
All of these things are completely common, by the way, so they're in no way surprising. Budget has to actually be set aside to continuously improve the reliability of a service, or it doesn't get done. These incidents are a good way to get that budget.(Wrt the regex's, I know they're implementing a new system that avoids a lot of it, but in the new system they can still write regex's which (I think) should be constructed programmatically)