←back to thread

698 points jgrahamc | 1 comments | | HN request time: 0.413s | source
Show context
_wmd ◴[] No.20422740[source]
So in response to a catastrophic failure due to testing in prod, they're going to push out a brand new regex engine with an ETA of 2 weeks. Can anyone say testing in prod?

The constant use of 'I' and 'me' (19 occurrences in total) deeply tarnishes this report, and repeatedly singling out a responsible engineer, nameless or not, is a failure in its own right. This was a collective failure, any individual identity is totally irrelevant. We're not looking for an account of your superman-like heroism, sprinting from meeting rooms or otherwise, we want to know whether anything has been learned in the 2 years since Cloudflare leaked heap all across the Internet without noticing, and the answer to that seems fantastically clear.

replies(6): >>20422871 #>>20422873 #>>20422891 #>>20422903 #>>20422924 #>>20424743 #
1. gfodor ◴[] No.20422873[source]
Wow, I'm amazed two people could read that writeup (yourself and myself) and come to two totally different conclusions.

Pushing out a brand new regex engine surely will go through the usual process. This doesn't seem like it will take a lot of time unless there are surprises. Cloudflare clearly has the infrastructure in place already to do a proper integration test for correctness test and rampup infrastructure to ensure it doesn't cause a global outage. The global nature of this outage was because the rampup infrastructure was explicitly not used as per the protocol.

I have no idea what you read where a single engineer was singled out. At several points in this post mortem the author identifies that the regex being written by the individual involved was far from the only cause of the outage. This is a very textbook blameless post mortem doc afaict.

The narrative about the actions taken and meetings which were in is also par for the course for a good post mortem since these variables are real, and should be addressed by remediation items if they contributed to the outage. (For example, is it sane that the entire engineering team was synchronously in a meeting? Probably not.)