> But it's pretty common for a major event to happen in a single region.
It's actually pretty rare these days because all major clouds use zone-redundancy and hence their core services are robust to the loss of any single building. Even during the recent Iberian power outages the local cloud sites mostly (entirely?) stayed up.
The outages I've experienced over the last decade(!) were: Global certificate expiry (Azure), Crowdstrike (Windows everywhere), IAM services down globally (AWS), core inter-region router misconfiguration (customer-wide).
None would have been avoided by having more replicas in more places. All of our production systems are already zone-redundant, which is either the default or "just a checkbox" in most clouds.
This article adds no value to the discussion because it states the problem that's not that big a deal, and then doesn't provide any useful solutions for the few people where it is a big deal.
The problem is either easy to solve -- tick the checkbox for zone-redundancy -- or very difficult to solve -- make your app's data globally replicated -- and the article just says "you should do it" without further elaboration.
That's of no value to anyone.
> Not everyone needs regional redundancy, and it does add costs, but I don't think it should be dismissed easily.
IMHO, it should be dismissed easily for almost everyone. I have far too many customers that think they need regional redundancy and end up paying 2-3x as much for something that they'll never use and wouldn't work anyway when they do need it.
> If you're all in on cloudiness, you could have as little as an alternate region replica of your data and your vm images, and be ready to go manually in another region if you need to.
This won't work for 90% of the customers that can afford it (big enterprise). Everyone, and I mean everyone forgets about internal DNS, Active Directory, PKI, and other core services. Some web servers won't start if they're missing half their dependencies, but that's "another team"... and that other team didn't have regional redundancy as one of their requirements. "Oops".
Not to mention that most clouds would immediately "run out" of capacity during such a DR. You'd be fighting against every other customer trying to do the same thing at the same time. I've been there, done that, and I've gotten "Resource unavailable, try again" errors.
The only way to guarantee that failover actually works is to pre-reserve 100% of the required VM capacity. This requires about 2x the spend at all times, whether that capacity is used or not.
> Run some tests once or twice a year to confirm your plan works, and to make an estimate for how long it takes to restore service in the event of a regional outage.
This ends up being a completely faked paperwork exercise. Over the last few years, I've seen this little game played out in various hilarious ways, including:
1) The tests were marked as "successful" but the 1 TB of data wasn't being replicated to the DR site. The tests were always to submit new data, which did work. "Ooops"
2) The tests involved failing over the "workload" while the file shares, domain controllers, DNS, etc... remained at the original primary location and had no replicas. "Ooops"
> A few minutes to put up an outage page and an hour or three to restore service is probably fine... Automatic regional failover gets tricky with data consistency and split brain as you mentioned; and hopefully you don't need to do it often.
Failover is the easy part. Now fail back without losing the data changes that occurred during the DR!
This is decidedly non-trivial unless you have bidirectional replication set up or a globally-available database like CosmosDB.
Inevitably the original site will come up and start accepting writes while the DR site is still up, and now you've got writes or transactions going to two places.
Reconciling that after-the-fact is awesome fun.
PS: No public cloud provides a convenient "global mutex" primitive on top of which such things can be easily built. You have to engineer this on a per-application basis, yourself. Good luck!