I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!
I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!
For example, for the fall over regions (from the article) you could make a pulumi function that parameterizes only the n things that are different per fall over env and guarantee / verify the scripts are nearly identical. Of course, many people use modules / terragrunt for similar reasons, but it ends up being quite powerful.
One of the question I frequently get is "do you automatically rollback". And I have hide in the corner and say "not really". Often, if you knew a rollback would work, you probably could also have known to not roll out in the first place. I've seen a lot of failures that only got worse when automation attempted to turn the thing on and off again.
Luckily from an automation roll-out standpoint, it's not that much harder to test in isolation. The harder parts to validate are things like "Does a Route 53 Failover Record really work in practice at the moment we actually need it to work?"
Usually the answer is yes, but then there's always the "but it too could be broken", and as you said, it's turtles all the way down.
The nice part is realistically, the automation for dealing with rollout and IaC is small and simple. We've split up our infrastructure to go with individual services, so each piece of infra is also straight forward.
In practice, our infra is less DRY and more repeated, which has the benefit of avoiding complexity that often comes from attempting to reduce code duplication. The ancillary benefit is that, simple stuff changes less frequently. Less frequent changes because less opportunity for issues.
Not-surprisingly, most incidents comes from changes humans make. Where the second most amount of incidents come from assumptions humans make about how a system operates in edge conditions. If you know these two things to be 100% true, you spend more time designing simple systems and attempting to avoid making changes as much as possible, unless it is absolutely required.
Pulumi or CDK are for sure more powerful (and great tools) but when I need to reach for them I also worry that the infra might be getting too complex.
We don't use the CDK because it introduces complexity into the system.
However to make CloudFormation usable, it is written in typescript, and generates the templates on the fly. I know that sounds like the CDK, but given the size of our stacks, adding an additional technology in, doesn't make things simpler, and there is a lot of waste that can be removed, by using a software language rather than using json/yaml.
There are cases we have some OpenTofu, but for infrastructure resources that customer specific, we have deployments that are run in typescript using the AWS SDK for javascript.
It would be nice if we could make a single change and have it roll-out everywhere. But the reality is that there are many more states in play then what is represented by a single state file. Especially when it comes to interactions between—our infra, our customer's configuration, and the history of requests to change the configuration, as well as resources with mutable states.
One example of that is AWS certificates. They expire. We need them expiring. But expiring certs don't magically update state files or stacks. It's really bad to make assumptions about a customer's environment based on what we thought we knew the last time a change was rolled out.
But if you need to do something in a particular way, the tools should never be an obstacle.
You still end up having IaaC. You can still have a declarative infrastructure.
Many people don't program with a configuration language like HCL. We use it as what it is - a DSL - that covers its main use case in an elegant manner. Maybe I never touched complex enough infra that twists a DSL into a general-use language, but in my experience there are simply no real benefits when using something like CDK (I never tried Pulumi to be fair).