Details of the Cloudflare outage on July 2, 2019

(blog.cloudflare.com)

698 points jgrahamc | 1 comments | 12 Jul 19 15:46 UTC | HN request time: 0.196s | source

Show context

buildzr ◴[12 Jul 19 18:09 UTC] No.20422952[source]▶

>>20421538 (OP) #

> Then we moved on to restoring the WAF functionality. Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location.

Haha, so the free customers are crash test dummies for providing test traffic. Nice.

I actually don't mind that much, considering it's basically bulletproof DDoS protection for free. I'd much rather "be the product" in this way than in the way ad companies cause at least.

replies(5): >>20423170 #>>20423194 #>>20423767 #>>20424021 #>>20424880 #

duxup ◴[12 Jul 19 19:42 UTC] No.20423767[source]▶

>>20422952 #

I used to work as a network engineer for awhile, now do web development. I worked with a number of cloud providers and you always have to roll out any fix carefully even if you're 100% sure (you're never 100% sure) that you've got the fix.

I honestly just assumed that when customer's chose where they would try things outside their lab, it was lower level customers, less busy part of the network, anywhere the impact isn't as serious. That's where the lowest risk is.

Some customers would discuss their own customer's by name as far as "Should we try this change on Customer Y?" And the discussion would work along those lines.

When I started deploying my own software, I just assumed anything that I was deploying to for free was a sort of "lab light" for them. I also don't mind, it seems fair.

ANY change outside a lab... is its own experiment.

replies(1): >>20424487 #

1. fragmede ◴[12 Jul 19 21:04 UTC] No.20424487[source]▶

>>20423767 #

Lowest risk, yes but not bulletproof.

Smaller customers don't have the same web traffic, which may not be enough to trip any given failure scenario. One could imagine that the backtracking in an onerous regexep is only triggered with a sufficiently large customer that has a path that is especially difficult to match.

With staged rollout and without a "fast" deploy procedure, by the time it hits the larger customers, it's already been deployed to some percentage of the fleet - and then you still have a problem, with a significant proportion of your fleet.

Staged rollouts are an entirely reasonable risk mitigation idea, mind you, and not one I'm even arguing against.

My point is that unfortunately it's no panacea, especially at scale. Which is what makes this all an experiment.

↑