Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

797 points burnerbob | 3 comments | 20 Jul 23 23:42 UTC | HN request time: 0.001s | source

Show context

throwawaaarrgh ◴[21 Jul 23 13:08 UTC] No.36813314[source]▶

>>36808296 (OP) #

There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #

seti0Cha ◴[21 Jul 23 14:36 UTC] No.36814300[source]▶

>>36813314 #

Not a great summary from my perspective. Here's what I got out of it:

- Their free tier support depended on noticing message board activity and they didn't.

- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.

- They had an unusually long outage for one particular server.

- Those points combined resulted in many people experiencing an unexplained prolonged outage.

- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.

- Some silliness with Discourse tags caused people to think they were trying to hide the problems.

In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.

replies(2): >>36814431 #>>36814471 #

tptacek ◴[21 Jul 23 14:47 UTC] No.36814431[source]▶

>>36814300 #

Apologies for repeating myself, but:

You get to a certain number of servers and the probability on any one day that some server somewhere is going to hiccup and bounce gets pretty high. That's what happened here: a single host in Sydney, one of many, had a problem.

When we have an incident with a single host, we update a notification channel for people with instances on that host. They are a tiny sliver of all our users, but of course that's cold comfort for them; they're experiencing an outage! That's what happened here: we did the single-host notification thing for users with apps on that Sydney host.

Normally, when we have a single-host incident, the host is back online pretty quickly. Minutes, maybe double-digit minutes if something gnarly happened. About once every 18 months or so, something worse than gnarly happens to a server (they're computers, we're not magic, all the bad things that happen to computers happen to us too). That's what happened here: we had an extended single-host outage, one that lasted over 12 hours.

(Specifically, if you're interested: somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache for OCI container images backing flyd; if containerd goes down, no new machines can start on the host. It took a member of our team, also a containerd maintainer, several hours to do battlefield surgery on that boltdb to bring the host back up.)

Now, as you can see from the fact that we were at the top of HN all night, there is a difference between a 5 minute single-host incident and a 12-hour single-host outage. Our runbook for single-host problems is tuned for the former. 12-hour single-host outages are pretty rare, and we probably want to put them on the global status page (I'm choosing my words carefully because we have an infra team and infra management and I'm not on it, and I don't want to speak for them or, worse, make commitments for them, all I can say is I get where people are coming with this one).

replies(4): >>36814531 #>>36814688 #>>36815167 #>>36816132 #

kunley ◴[21 Jul 23 15:44 UTC] No.36815167[source]▶

>>36814431 #

> somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache

Hey, even if I can feel sympathetic for the course of unfortunate events, it's hard to not to comment:

if you're using a cache, you should invalidate it on failure!

replies(1): >>36815351 #

1. tptacek ◴[21 Jul 23 15:55 UTC] No.36815351[source]▶

>>36815167 #

It's a read-through cache. This wasn't a cache invalidation issue. It's a systems-level state corruption problem that just happened to break a system used primarily as a cache.

replies(1): >>36821279 #

2. kunley ◴[21 Jul 23 23:13 UTC] No.36821279[source]▶

>>36815351 (TP) #

What I meant is that if the compromised host was unable to use broken boltdb cache, the cache should be zeroed and repopulated. Was it really hours of such cache rebuild vs hours of trying to fix the boltdb?

Btw I am happy I got only small amounts of data in any of bolt databases...

replies(1): >>36821769 #

3. tptacek ◴[22 Jul 23 00:11 UTC] No.36821769[source]▶

>>36821279 #

This isn't a boltdb we designed. It's just containerd. I am probably not doing the outage justice, because "blitz and repopulate" is a time-honored strategy here.

↑