Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

Show context

throwawaaarrgh ◴[21 Jul 23 13:08 UTC] No.36813314[source]▶

>>36808296 (OP) #

There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #

seti0Cha ◴[21 Jul 23 14:36 UTC] No.36814300[source]▶

>>36813314 #

Not a great summary from my perspective. Here's what I got out of it:

- Their free tier support depended on noticing message board activity and they didn't.

- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.

- They had an unusually long outage for one particular server.

- Those points combined resulted in many people experiencing an unexplained prolonged outage.

- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.

- Some silliness with Discourse tags caused people to think they were trying to hide the problems.

In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.

replies(2): >>36814431 #>>36814471 #

tptacek ◴[21 Jul 23 14:47 UTC] No.36814431[source]▶

>>36814300 #

Apologies for repeating myself, but:

You get to a certain number of servers and the probability on any one day that some server somewhere is going to hiccup and bounce gets pretty high. That's what happened here: a single host in Sydney, one of many, had a problem.

When we have an incident with a single host, we update a notification channel for people with instances on that host. They are a tiny sliver of all our users, but of course that's cold comfort for them; they're experiencing an outage! That's what happened here: we did the single-host notification thing for users with apps on that Sydney host.

Normally, when we have a single-host incident, the host is back online pretty quickly. Minutes, maybe double-digit minutes if something gnarly happened. About once every 18 months or so, something worse than gnarly happens to a server (they're computers, we're not magic, all the bad things that happen to computers happen to us too). That's what happened here: we had an extended single-host outage, one that lasted over 12 hours.

(Specifically, if you're interested: somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache for OCI container images backing flyd; if containerd goes down, no new machines can start on the host. It took a member of our team, also a containerd maintainer, several hours to do battlefield surgery on that boltdb to bring the host back up.)

Now, as you can see from the fact that we were at the top of HN all night, there is a difference between a 5 minute single-host incident and a 12-hour single-host outage. Our runbook for single-host problems is tuned for the former. 12-hour single-host outages are pretty rare, and we probably want to put them on the global status page (I'm choosing my words carefully because we have an infra team and infra management and I'm not on it, and I don't want to speak for them or, worse, make commitments for them, all I can say is I get where people are coming with this one).

replies(4): >>36814531 #>>36814688 #>>36815167 #>>36816132 #

1. seti0Cha ◴[21 Jul 23 15:05 UTC] No.36814688[source]▶

>>36814431 #

It seems to me like there's room for improving your customers' awareness around what is required for HA and how to tell when they are affected by a hardware issue. On the other hand, it may just be that the confusion is mostly amongst the casual onlookers, in which case you have my sympathies!

replies(1): >>36814971 #

2. CoolCold ◴[21 Jul 23 15:31 UTC] No.36814971[source]▶

>>36814688 (TP) #

I'm not sure on this, will it make any sense - customers who DON'T WANT to be aware of what is required for HA (say lonely devs) choosing such a hosting types. Even if you put educational articles, I'm unsure it will be used. Putting some BANNER IN RED LETTERS into CLI output + link to article may work, though.

What do you think?

replies(2): >>36815207 #>>36815219 #

3. wgjordan ◴[21 Jul 23 15:46 UTC] No.36815207[source]▶

>>36814971 #

This is exactly how it currently works:

  $ fly volumes create mydata
  Warning! Individual volumes are pinned to individual hosts.
  You should create two or more volumes per application.
  You will have downtime if you only create one.
  Learn more at https://fly.io/docs/reference/volumes/
  ? Do you still want to use the volumes feature? (y/N)

(and yes, the warning is already even in red letters too)

replies(1): >>36816705 #

4. seti0Cha ◴[21 Jul 23 15:46 UTC] No.36815219[source]▶

>>36814971 #

I agree, articles tend not to get read by those who need them most. A warning from the CLI and a banner on the app management page with a link to a detailed explanation would seem like a good approach.

edit: sibling post shows there is such a message on the CLI. The only other thing I can think of is an "Are you sure you want to do this?" prompt, but in the end you can't reach everybody.

replies(2): >>36816844 #>>36816892 #

5. CoolCold ◴[21 Jul 23 17:25 UTC] No.36816705{3}[source]▶

>>36815207 #

Sounds like it have not help already - no even need to guess. One of the moments you are in mixed feelings about being right.

6. CoolCold ◴[21 Jul 23 17:35 UTC] No.36816844{3}[source]▶

>>36815219 #

Indeed

7. tptacek ◴[21 Jul 23 17:39 UTC] No.36816892{3}[source]▶

>>36815219 #

There is an "Are you sure want to do this?" prompt!

replies(1): >>36817856 #

8. seti0Cha ◴[21 Jul 23 18:49 UTC] No.36817856{4}[source]▶

>>36816892 #

Make them type the phrase "I'm OK with downtimes of arbitrary length"!

I kid, seems like you guys did what you could.

↑