How when AWS was down, we were not

(authress.io)

203 points mooreds | 1 comments | 17 Nov 25 17:07 UTC | HN request time: 0s | source

Show context

rdoherty ◴[17 Nov 25 21:02 UTC] No.45958246[source]▶

This is probably one of the best summarizations of the past 10 years of my career in SRE. Once your systems get complex enough, something is always broken and you have to prepare for that. Detection & response become just as critical as pre-deploy testing.

I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!

replies(2): >>45958417 #>>45958775 #

evanmoran ◴[17 Nov 25 21:16 UTC] No.45958417[source]▶

>>45958246 #

Iac is definitely a failure point, but the manual alternative is much worse! I’ve had a lot of benefit from using pulumi, simply because the code can be more compact than the terraform hcl was.

For example, for the fall over regions (from the article) you could make a pulumi function that parameterizes only the n things that are different per fall over env and guarantee / verify the scripts are nearly identical. Of course, many people use modules / terragrunt for similar reasons, but it ends up being quite powerful.

replies(3): >>45958669 #>>45958804 #>>45958816 #

1. spyspy ◴[17 Nov 25 21:40 UTC] No.45958669[source]▶

>>45958417 #

If you do use terraform, for the love of god do NOT use Terraform Cloud. Up there with Github in the list of least reliable cloud vendors. I always have a "break glass" method of deploying from my work machine for that very reason.

↑