←back to thread

462 points jakevoytko | 1 comments | | HN request time: 0.34s | source
Show context
aetimmes ◴[] No.43493994[source]
(disclaimer: I know OP IRL.)

I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:

At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.

Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.

replies(5): >>43494924 #>>43495048 #>>43495849 #>>43496185 #>>43497031 #
seeingnature ◴[] No.43495048[source]
I'd love to see the rest of your postmortem template! I never thought about adding a "Where did we get lucky?" question.

I recently realized that one question for me should be, "Did you panic? What was the result of that panic? What caused the panic?"

I had taken down a network, and the device led me down a pathway that required multiple apps and multiple log ins I didn't have to regain access. I panicked and because the network was small, roamed and moved all devices to my backup network.

The following day, under no stress, I realized that my mistake was that I was scanning a QR code 90 degrees off from it's proper orientation. I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation. Then it was simple to gain access to that device. I couldn't even replicate the other odd path.

replies(6): >>43495406 #>>43495546 #>>43495814 #>>43496045 #>>43496082 #>>43496261 #
1. nathan_douglas ◴[] No.43496261[source]
A good section to have is one on concept/process issues you encountered, which I think is a generalization of your question about panic.

For instance, you might be mistaken about the operation of a system in some way that prolongs an outage or complicates recovery. Or perhaps there are complicated commands that someone pasted in a comment in a Slack channel once upon a time and you have to engage in gymnastics with Sloogle™ to find them, while the PM and PO are requesting updates. Or you end up saving the day because of a random confluence of rabbit holes you'd traversed that week, but you couldn't expect anyone else on the team to have had the same flash of insight that you did.

That might be information that is valuable to document or add to training materials before it is forgotten. A lot of postmortems focus on the root cause, which is great and necessary, but don't look closely at the process of trying to stop the bleeding.