How when AWS was down, we were not

1. thisnullptr ◴[18 Nov 25 02:13 UTC] No.45960639[source]▶

It’s fascinating to me people think their services are so important they can’t survive any downtime. Can we all admit that, while annoying, nothing really bad happened even when us-east-1 was down for almost half a working day?

replies(5): >>45960777 #>>45960806 #>>45961671 #>>45961769 #>>45966649 #

2. JSR_FDED ◴[18 Nov 25 02:33 UTC] No.45960777[source]▶

>>45960639 (TP) #

If you’re providing auth services to many companies then a failure will increase the likelihood of something bad to an unacceptable degree.

3. shoo ◴[18 Nov 25 02:39 UTC] No.45960806[source]▶

>>45960639 (TP) #

In many contexts you are correct & further, as someone in that earlier thread about the AWS us-east-1 outage mentioned, customers can be more forgiving of outages if you as the vendor can point to a widespread AWS us-east-1 outage and note that us-east-1 is down for everyone.

But, as JSR_FDED's sibling comment notes & as is spelled out in the article, authress' business model offering an auth service means that their outage may entirely brick their clients customer facing auth / machine to machine auth.

I've worked in megacorp environments where an outage of certain internal services responsible for auth or issuing JWTs would break tens or hundreds of internal services and break various customer-facing flows. In many business contexts a big messy customer facing outage for a day or so doesn't actually matter but in some contexts it really can. In terms of blast radius, unavailability of a key auth service depended on by hundreds of things is up there with, i dunno, breaking the network.

replies(1): >>45962940 #

4. bostik ◴[18 Nov 25 05:21 UTC] No.45961671[source]▶

>>45960639 (TP) #

As other posters have commented, an external auth service is a very special thing indeed. In modern and/or zero-trust systems if auth doesn't work, then effectively nothing works.

My rule of thumb from the past experiences is that if you demand a 99.9% uptime for your own systems and you have an in-house auth, then that auth system must have 99.99% reliability. If you are serving auth for OTHERS, then you have a system that can absolutely never be down, and at that point five nines becomes a baseline requirement.

Auth is a critical path component. If your service is in the critical path in both reliability and latency[ß] for third parties, then every one of your failures is magnified by the number of customers getting hit by it.

ß: The current top-voted comment thread includes a mention that latency and response time should also be part of an SLA concern. I agree. For any hot-path system you must be always tracking the latency distribution, both from the service's own viewpoint AND from the point of view of the outside world. The typically useful metrics for that are p95, p99, p999 and max. Yes, max is essential to include: you want to always know what was the worst experience someone/something had during any given time window.

replies(1): >>45962931 #

5. catlifeonmars ◴[18 Nov 25 05:44 UTC] No.45961769[source]▶

>>45960639 (TP) #

It’s not that things can’t survive downtime technically, it’s that in _many_ cases (although as you rightly point out, not _most_) downtime is costly to businesses.

I agree that the set of business critical functions in most shops is going to be vastly overestimated by engineers in the ground.

6. wparad ◴[18 Nov 25 09:02 UTC] No.45962931[source]▶

>>45961671 #

The sad truth of the world is that in many cases latency isn't the most critical aspect for tracking. We absolutely do track it because we have the expectation that authentication requests complete. But there are many moving parts to this that make reliable tracking not entirely feasible: * end location of user * end location of customer service * third party login components (login with google, et al) * corporate identity providers * webauthn * customer specific login mechanism workflows * custom integrations for those login mechanisms * user's user agent * internet connectivity

All of those significantly influence the response capability in a way which makes tracking latency next to useless. Maybe there is something we can be doing though. In more than a couple scenarios we do have tracking in place, metrics, and alerting, it just doesn't end up in our SLA.

replies(2): >>45968351 #>>45970154 #

7. wparad ◴[18 Nov 25 09:03 UTC] No.45962940[source]▶

>>45960806 #

Absolutely, part of the problem is that a whole region being down is often less of a problem, then just one critical service. And as you point out the blast radius of a critical dependency is huge.

8. filearts ◴[18 Nov 25 14:34 UTC] No.45966649[source]▶

>>45960639 (TP) #

That's a bit of a naive perspective. There are plenty of situations and industries where access being down has an impact far beyond inconvenience. For example, access to medical files for treatment, allergies and surgery. Or access to financial services.

9. bostik ◴[18 Nov 25 16:25 UTC] No.45968351{3}[source]▶

>>45962931 #

While I agree with parts of the above, there are bits that I disagree with. It's true that you cannot control the network conditions for third parties, and therefore can never be in a position where you would guarantee an SLA for round-trip experience. But I object the notion that tracking end-to-end latency is useless. After all, the three Nielsen usability thresholds are all about latency(!)

Funnily enough, looking through your itemisation I spot two groups that would each benefit from their own kinds of latency monitoring. End location and internet connectivity of the client go into the first. Third-party providers go into the second.

For the first, you'd need to have your own probes reporting from the most actively used networks and locations around the world - that would give you a view into the round-trip latency per major network path. For the second, you'd want to track the time spent between the steps that you control - which in turn would give you a good view into the latency-inducing behaviour of the different third-party providers. Neither are SLA material but they certainly would be useful during enterprise contract negotiations. (Shooting impossible demands down by showing hard data tends to fend off even the most obstinate objections.)

User-agent and bespoke integrations/workflows are entirely out of your hands, and I agree it's useless to try to measure latency for them specifically.

Disclaimer: I have worked with systems where the internal authX roundtrip has to complete within 1ms, and the corresponding client-facing side has to complete its response within 3ms.

10. scottlamb ◴[18 Nov 25 18:34 UTC] No.45970154{3}[source]▶

>>45962931 #

I imagine you exclude failures of customer systems from your reliability measurements—for example, if you send a backend request to or redirect the user's browser to the customer's corporate identity provider and that persistently fails, you don't call it your own outage.

The same can apply to latency. What is the latency of requests to your system—including dependencies you choose, excluding dependencies the customer chooses. The network leg from the customer or user to your system is a bit of a gray area. The simplest thing to do is measure each request's latency from the point of view of your backend rather than the initiator. This is probably good enough, although in theory it lets you off the hook a bit too easily—to some extent you can choose whether you run near the initiator or not and how many round trips are required, and servers can underestimate their own latency or entirely miss requests during failures. But it's not fair to fail your SLA because of end-user bufferbloat or bad wifi or a crappy ancient Chromebook with too many open tabs or customer webapp server's GC spiral or whatever. Basically impossible to make any 99.999% promises when those things are in play.

My preferred form of SLO is: x% of requests given y ms succeed within y ms, measured by my server. ("given" meaning "does not have an upfront timeout shorter than" and "isn't aborted by the client before".) I might offer a few such guarantees for a particular request type, e.g.:

* 50% of lookups given 1 ms succeed within 1 ms.

* 99% of lookups given 10 ms succeed within 10 ms.

* 99.999% of lookups given 500 ms succeed within 500 ms.

I like to also have client-side and whole-flow measurements but I'm much more cautious about promising anything about them.