How when AWS was down, we were not

1. scottlamb ◴[17 Nov 25 23:11 UTC] No.45959519[source]▶

I'm surprised the section about retries doesn't mention correlations. They say:

> P_{total}(Success) = 1 - P_{3rdParty}(Failure)^{RetryCount}

By treating P_{3rdParty}(Failure) as fixed, they're assuming a model in which each each try is completely independent: all the failures are due to background noise. But that's totally wrong, as shown by the existence of big outages like the one they're describing, and not consistent with the way they describe outages in terms of time they are down (rather than purely fraction of requests).

In reality, additional retries don't improve reliability as much as that formula says. Given that request 1 failed, request 2 (sent immediately afterward with the same body) probably will too. And there's another important effect: overload. During a major outage, retries often decrease reliability in aggregate—maybe retrying one request makes it more likely to go through, but retrying all the requests causes significant overload, often decreasing the total number of successes.

I think this correlation is a much bigger factor than "the reliability of that retry handler" that they go into instead. Not sure what they mean there anyway—if the retry handler is just a loop within the calling code, calling out its reliability separately from the rest of the calling code seems strange to me. Maybe they're talking about an external queue (SQS and the like) for deferred retries, but that brings in a whole different assumption that they're talking about something that can be processed asynchronously. I don't see that mentioned, and it seems inconsistent with the description of these requests as on the critical path for their customers. Or maybe they're talking about hitting a "circuit breaker" that prevents excessive retries—which is a good practice due to the correlation I mentioned above, but if so it seems strange to describe it so obliquely, and again strange to describe its reliability as an inherent/independent thing, rather than a property of the service being called.

Additionally, a big pet peeve of mine is talking about reliability without involving latency. In practice, there's only so long your client is willing to wait for the request to succeed. If say that's 1 second, and you're waiting 500 ms for an outbound request before timing out and retrying, you can't even quite make it to 2 full (sequential) tries. You can hedge (wait a bit then send a second request in parallel) for many types of requests, but that also worsens the math on overload and correlated failures.

The rest of the article might be much clearer, but I have a fever and didn't make it through.

replies(2): >>45959617 #>>45960718 #

2. lorrin ◴[17 Nov 25 23:22 UTC] No.45959617[source]▶

>>45959519 (TP) #

Agreed, I think the introduction is wrong and detracts from the rest of the article.

replies(2): >>45960691 #>>45962915 #

3. ◴[18 Nov 25 02:22 UTC] No.45960691[source]▶

>>45959617 #

4. shoo ◴[18 Nov 25 02:26 UTC] No.45960718[source]▶

>>45959519 (TP) #

> the section about retries doesn't mention correlations. [...] By treating P_{3rdParty}(Failure) as fixed, they're assuming a model in which each each try is completely independent: all the failures are due to background noise. But that's totally wrong, as shown by the existence of big outages like the one they're describing

Yes, that jumped out at me as well. A slightly more sophisticated model could be to assume there are two possible causes of a failed 3rd party call: (a) a transient issue - failure can be masked by retrying, and (b) a serious outage - where retrying is likely to find that the 3rd party dependency is still unavailable.

Our probabilistic model of this 3rd party dependency could then look something like

  P(first call failure) = 0.10
  P(transient issue | first call failure) = 0.90
  P(serious outage | first call failure) = 0.10
  P(call failure | transient issue, prior call failure) = 0.10
  P(call failure | serious outage, prior call failure) = 0.95

I.e. a failed call is 9x more likely to be caused by a transient issue than a serious outage. If the cause was a transient issue we assume independence between sequential attempts like in the article, but if the failure was caused by a serious outage there's only a 5% chance that each sequential retry attempt will succeed.

In contrast with the math sketched in the article, where retrying a 3rd party call with a 10% failure rate 5 times could suffice for a 99.999% success rate, with the above model of failure modes including a serious outage failure mode producing a string of failures, we'd need to retry 135 times after a first failed call to achieve the same 99.999% success rate.

Your points about overall latency client is willing to wait & retries causing additional load are good, in many systems "135 retry attempts" is impractical and would mean "our overall system has failed and is unavailable".

Anyhow, it's still an interesting article. The meat of the argument and logic about 3rd party deps needing to meet some minimum bar of availability to be included still makes sense, but if our failure model considers failure modes like lengthy outages that can cause correlated failure patterns, that raises the bar for how reliable any given 3rd party dep needs to be even further.

replies(1): >>45962912 #

5. wparad ◴[18 Nov 25 08:58 UTC] No.45962912[source]▶

>>45960718 #

This is absolutely true, but the end result is the same. The assumption is "We can fix a third party component behaving temporarily incorrectly, and therefore we can do something about it". If the third party component never behaves correctly, then nothing we can do to fix it.

Correlations don't have to be talked about, because they don't increase the likelihood for success, but rather the likihood of failure, meaning that we would need orders of magnitude more reliable technology to solve that problem.

In reality, those sorts of failures aren't usually temporary, but rather systemic, such as "we've made an incorrect assumption about how that technology works" - feature not a bug.

In that case, it doesn't really fit into this model. There are certainly things that would better indicate to us that we could use or are not allowed to use a component, but for the sake of the article, I think that was probably going much to far.

TL;DR Yes for sure, individual attempts are correlated, but in most cases, it doesn't make sense to track that because those situations end up in other buckets of "always down = unreliable" or "actually up - more complex story which may not need to be modelled".

replies(1): >>45969083 #

6. wparad ◴[18 Nov 25 08:59 UTC] No.45962915[source]▶

>>45959617 #

Hmmm, which part of the intro did you find an issue with? I want to see if I can fix it.

7. scottlamb ◴[18 Nov 25 17:12 UTC] No.45969083{3}[source]▶

>>45962912 #

I think the reasoning matters as much as the answer, and you had to make at least a couple strange turns to get the "right answer" that retries don't solve the problem:

* the 3rd-party component offering only 90% success—I've never actually seen a system that bad. 99.9% success SLA is kind of the minimum, and in practice any system that has acceptable mean and/or 99%/99.9% latency for a critical auth path also has >=99.99% success in good conditions (even if they don't promise refunds based on that).

* the whole "really reliable retry handler" thing—as mentioned in my first comment, I don't understand what you were getting at here.

I would go a whole other way with this section—more realistic, much shorter. Let's say you want to offer 99.999% success within 1 second, and the third-party component offers 99.9% success per try. Then two tries gives you 99.9999% success if the failures are all uncorrelated but retries do not help at all when the third-party system is down for minutes or hours at a time. [1] Thus, you need to involve an alternative that is believed to be independent of the faulty system—and the primary tool AWS gives you for that is regional independence. This sets up the talk about regional failover much more quickly and with less head-scratching. I probably would have made it through the whole article yesterday even in my feverish state.

[1] unless this request can be done asynchronously, arbitrarily later, in which case the whole chain of thought afterward goes a different way.

replies(1): >>45973152 #

8. wparad ◴[18 Nov 25 22:32 UTC] No.45973152{4}[source]▶

>>45969083 #

Hmm, I never considered potentially using an SLA on latency as a potential way to justify the argument. If I pull this content into a future article or talk, I will definitely consider reframing it for easier understanding.