←back to thread

210 points dakshgupta | 1 comments | | HN request time: 0.201s | source
Show context
jedberg ◴[] No.41841847[source]
> this is also a very specific and usually ephemeral situation - a small team running a disproportionately fast growing product in a hyper-competitive and fast-evolving space.

This is basically how we ran things for the reliability team at Netflix. One person was on call for a week at a time. They had to deal with tickets and issues. Everyone else was on backup and only called for a big issue.

The week after you were on call was spent following up on incidents and remediation. But the remaining weeks were for deep work, building new reliability tools.

The tools that allowed us to be resilient enough that being on call for one week straight didn't kill you. :)

replies(1): >>41842151 #
dakshgupta ◴[] No.41842151[source]
I am surprised and impressed a company at that scale functions like this. We often internally discuss if we can still doing this when we’re 7-8 engineers.
replies(1): >>41842458 #
jedberg ◴[] No.41842458[source]
I think you're looking at it backwards. We were only able to do it because we had so many engineers that we had time to write tools to make the system reliable enough.

On call for a week at a time only really works if you only get paged at night once a week max. If you get paged every night, you will die from sleep deprivation.

replies(1): >>41845266 #
1. dmoy ◴[] No.41845266[source]
Moving from 24/7 oncall to 12 hour shifts trading off with another continent is really nice