←back to thread

226 points beedeebeedee | 1 comments | | HN request time: 0.24s | source
Show context
peterkos ◴[] No.41900587[source]
I'm reminded of a time that an intern took down us-east1 on AWS, by modifying a configuration file they shouldn't have had access to. Amazon (somehow) did the correct thing and didn't fire them -- instead, they used the experience to fix the security hole. It was a file they shouldn't have had access to in the first place.

If the intern "had no experience with the AI lab", is it the right thing to do to fire them, instead of admitting that there is a security/access fault internally? Can other employees (intentionally, or unintentionally) cause that same amount of "damage"?

replies(12): >>41900622 #>>41900627 #>>41900641 #>>41900805 #>>41900919 #>>41901069 #>>41901814 #>>41903916 #>>41909887 #>>41910021 #>>41910134 #>>41910235 #
grogenaut ◴[] No.41900641[source]
From what I've seen in Amazon it's pretty consistent that they do not blame the messenger which is what they consider the person who messed up. Usually that person is the last in a long series of decisions that could have prevented the issue, and thus why blame them. That is unless the person is a) acting with malice, b) is repeatedly shown a pattern of willful ignorance. IIRC, when one person took down S3 with a manual command overriding the safeguards the action was not to fire them but to figure out why it was still a manual process without sign off. Say what you will about Amazon culture, the ability to make mistakes or call them out is pretty consistently protected.
replies(5): >>41900811 #>>41901212 #>>41911207 #>>41914419 #>>41915916 #
1. Twirrim ◴[] No.41911207[source]
> when one person took down S3 with a manual command overriding the safeguards

It didn't override safeguards, but they sure wanted you to think that something unusual was done as part of the incident. What they executed was a standard operational command. The problem was, the components that that command interacted with had been creaking at the edges for years by that point. It was literally a case of "when", and not "if". All that happened was the command tipped it over the edge in combination with everything else happening as part of normal operational state.

Engineering leadership had repeatedly raised the risk with further up the chain and no one was willing to put headcount to actually mitigating the problem. If blame was to be applied anywhere, it wasn't on the engineer following the run book that gave them a standard operational command to execute with standard values. They did exactly what they were supposed to.

Some credit where it's due, my understanding from folks I knew still in that space, is that S3 leadership started turning things around after that incident and started taking these risks and operational state seriously.