Most active commenters

tptacek(5)
seti0Cha(4)
CSSer(3)
CoolCold(3)

Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

Show context

throwawaaarrgh ◴[21 Jul 23 13:08 UTC] No.36813314[source]▶

>>36808296 (OP) #

There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #

seti0Cha ◴[21 Jul 23 14:36 UTC] No.36814300[source]▶

>>36813314 #

Not a great summary from my perspective. Here's what I got out of it:

- Their free tier support depended on noticing message board activity and they didn't.

- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.

- They had an unusually long outage for one particular server.

- Those points combined resulted in many people experiencing an unexplained prolonged outage.

- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.

- Some silliness with Discourse tags caused people to think they were trying to hide the problems.

In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.

replies(2): >>36814431 #>>36814471 #

1. tptacek ◴[21 Jul 23 14:47 UTC] No.36814431[source]▶

>>36814300 #

Apologies for repeating myself, but:

You get to a certain number of servers and the probability on any one day that some server somewhere is going to hiccup and bounce gets pretty high. That's what happened here: a single host in Sydney, one of many, had a problem.

When we have an incident with a single host, we update a notification channel for people with instances on that host. They are a tiny sliver of all our users, but of course that's cold comfort for them; they're experiencing an outage! That's what happened here: we did the single-host notification thing for users with apps on that Sydney host.

Normally, when we have a single-host incident, the host is back online pretty quickly. Minutes, maybe double-digit minutes if something gnarly happened. About once every 18 months or so, something worse than gnarly happens to a server (they're computers, we're not magic, all the bad things that happen to computers happen to us too). That's what happened here: we had an extended single-host outage, one that lasted over 12 hours.

(Specifically, if you're interested: somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache for OCI container images backing flyd; if containerd goes down, no new machines can start on the host. It took a member of our team, also a containerd maintainer, several hours to do battlefield surgery on that boltdb to bring the host back up.)

Now, as you can see from the fact that we were at the top of HN all night, there is a difference between a 5 minute single-host incident and a 12-hour single-host outage. Our runbook for single-host problems is tuned for the former. 12-hour single-host outages are pretty rare, and we probably want to put them on the global status page (I'm choosing my words carefully because we have an infra team and infra management and I'm not on it, and I don't want to speak for them or, worse, make commitments for them, all I can say is I get where people are coming with this one).

replies(4): >>36814531 #>>36814688 #>>36815167 #>>36816132 #

2. CSSer ◴[21 Jul 23 14:54 UTC] No.36814531[source]▶

>>36814431 (TP) #

Why are your customers exposed to this? This sounds like a tough problem that I'm sympathetic to for you personally, but it sounds like there's no failover or appropriate redundancy in place to rollover to while you work to fix the problem.

edit: I hope this comment doesn't sound accusatory. At the end of the day I want everyone to succeed. I hope there's a silver lining to this in the post-mortem.

replies(1): >>36814624 #

3. tptacek ◴[21 Jul 23 15:00 UTC] No.36814624[source]▶

>>36814531 #

The way to not be exposed to this is to run an HA configuration with more than one instance.

If you're running an app on Fly.io without local durable storage, then it's easy to fail over to another server. But durable storage on Fly.io is attached NVMe storage.

By far the most common way people use durable storage on Fly.io is with Postgres databases. If you're doing that on Fly.io, we automatically manage failover at the application layer: you run multiple instances, they configure themselves in a single-writer multi-reader cluster, and if the leader fails, a replica takes over.

We will let you run a single-instance Postgres "cluster", and people definitely do that. The downside to that configuration is, if the host you're on blows up, your availability can take a hit. That's just how the platform works.

replies(2): >>36814820 #>>36815463 #

4. seti0Cha ◴[21 Jul 23 15:05 UTC] No.36814688[source]▶

>>36814431 (TP) #

It seems to me like there's room for improving your customers' awareness around what is required for HA and how to tell when they are affected by a hardware issue. On the other hand, it may just be that the confusion is mostly amongst the casual onlookers, in which case you have my sympathies!

replies(1): >>36814971 #

5. CSSer ◴[21 Jul 23 15:19 UTC] No.36814820{3}[source]▶

>>36814624 #

I see. Have you considered eliminating this configuration from your offering? It sounds like the terminology could confuse people, and it may be the case that they're assuming that a host isn't really what it is (a single host). This kind of thing is difficult for those seeking to build managed services, because I think people expect you to provide offerings that can't harm them when the cause is related to the service they're paying for and it's difficult to figure out which sharp objects they understand and which ones they don't. People should know better, but if they did would they need you?

If this sounds ludicrous, then I think I probably don't understand who Fly.io wants to be and that's okay. If I don't understand, however, you may want to take a look at your image and messaging to potentially recalibrate what kind of customers you're attracting.

replies(1): >>36815012 #

6. CoolCold ◴[21 Jul 23 15:31 UTC] No.36814971[source]▶

>>36814688 #

I'm not sure on this, will it make any sense - customers who DON'T WANT to be aware of what is required for HA (say lonely devs) choosing such a hosting types. Even if you put educational articles, I'm unsure it will be used. Putting some BANNER IN RED LETTERS into CLI output + link to article may work, though.

What do you think?

replies(2): >>36815207 #>>36815219 #

7. TheDong ◴[21 Jul 23 15:34 UTC] No.36815012{4}[source]▶

>>36814820 #

Plenty of people would rather take downtime than pay for redundancy, for example for a test database.

AWS RDS lets you spin up a RDS instance that costs 3x less and regularly has downtime (the 'single-az' one), quite similar to this.

Anyone who's used servers before knows "A single instance" is the same as "sometimes you might have downtime".

Computers aren't magic, everyone from heroku (you must have multiple dynos to be high availability) to ec2 (multiple instances across AZs) agree on "a single machine is not redundant". I don't see how fly's messaging is out of line with that. They don't tell you anywhere "Our apps and machines are literally magic and will never fail".

replies(2): >>36816557 #>>36821067 #

8. kunley ◴[21 Jul 23 15:44 UTC] No.36815167[source]▶

>>36814431 (TP) #

> somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache

Hey, even if I can feel sympathetic for the course of unfortunate events, it's hard to not to comment:

if you're using a cache, you should invalidate it on failure!

replies(1): >>36815351 #

9. wgjordan ◴[21 Jul 23 15:46 UTC] No.36815207{3}[source]▶

>>36814971 #

This is exactly how it currently works:

  $ fly volumes create mydata
  Warning! Individual volumes are pinned to individual hosts.
  You should create two or more volumes per application.
  You will have downtime if you only create one.
  Learn more at https://fly.io/docs/reference/volumes/
  ? Do you still want to use the volumes feature? (y/N)

(and yes, the warning is already even in red letters too)

replies(1): >>36816705 #

10. seti0Cha ◴[21 Jul 23 15:46 UTC] No.36815219{3}[source]▶

>>36814971 #

I agree, articles tend not to get read by those who need them most. A warning from the CLI and a banner on the app management page with a link to a detailed explanation would seem like a good approach.

edit: sibling post shows there is such a message on the CLI. The only other thing I can think of is an "Are you sure you want to do this?" prompt, but in the end you can't reach everybody.

replies(2): >>36816844 #>>36816892 #

11. tptacek ◴[21 Jul 23 15:55 UTC] No.36815351[source]▶

>>36815167 #

It's a read-through cache. This wasn't a cache invalidation issue. It's a systems-level state corruption problem that just happened to break a system used primarily as a cache.

replies(1): >>36821279 #

12. kevin_nisbet ◴[21 Jul 23 16:03 UTC] No.36815463{3}[source]▶

>>36814624 #

Unless something has changed and I'm out of date, I think a piece of context here is fly postgres isn't really a managed service offering. From what I've seen fly does try to message this, but I think it's still easy for some subset of customers to miss that they're deploying an OSS component, maybe deployed a non-HA setup and forgot, and it's not the same as buying a database as a service.

So hopefully as fly.io get's more popular, there will be some compelling managed offerings. I saw comments at one point from the neon CEO about a fly.io offering, but not sure if that went anywhere. I'm sure customers can also use crunchy, or other offerings.

13. laweijfmvo ◴[21 Jul 23 16:45 UTC] No.36816132[source]▶

>>36814431 (TP) #

> over 12 hours

How much is over 12 hours? 12 hours and 10 minutes? 13 hours? 67 days?

14. remram ◴[21 Jul 23 17:14 UTC] No.36816557{5}[source]▶

>>36815012 #

Single-AZ i not single-host though, and while a single AZ can go down for major events, it doesn't break because a single piece of hardware failed.

replies(1): >>36816666 #

15. makoz ◴[21 Jul 23 17:22 UTC] No.36816666{6}[source]▶

>>36816557 #

Sure, but isn't this more about risk tolerance at this point and how much your customers care about? Where the responsibility should be on customer's end. Running on EBS/RDS doesn't guarantee you won't lose data. If you care about it, you enable backups and test recovery.

Just because some customers are less fault tolerant than others, doesn't mean we shouldn't offer those options where people don't have the same requirements or are willing to work around it.

16. CoolCold ◴[21 Jul 23 17:25 UTC] No.36816705{4}[source]▶

>>36815207 #

Sounds like it have not help already - no even need to guess. One of the moments you are in mixed feelings about being right.

17. CoolCold ◴[21 Jul 23 17:35 UTC] No.36816844{4}[source]▶

>>36815219 #

Indeed

18. tptacek ◴[21 Jul 23 17:39 UTC] No.36816892{4}[source]▶

>>36815219 #

There is an "Are you sure want to do this?" prompt!

replies(1): >>36817856 #

19. seti0Cha ◴[21 Jul 23 18:49 UTC] No.36817856{5}[source]▶

>>36816892 #

Make them type the phrase "I'm OK with downtimes of arbitrary length"!

I kid, seems like you guys did what you could.

20. CSSer ◴[21 Jul 23 22:54 UTC] No.36821067{5}[source]▶

>>36815012 #

I don't disagree. I was latching onto the idea that people are running single-node "clusters". Whatever it is, it isn't a cluster.

21. kunley ◴[21 Jul 23 23:13 UTC] No.36821279{3}[source]▶

>>36815351 #

What I meant is that if the compromised host was unable to use broken boltdb cache, the cache should be zeroed and repopulated. Was it really hours of such cache rebuild vs hours of trying to fix the boltdb?

Btw I am happy I got only small amounts of data in any of bolt databases...

replies(1): >>36821769 #

22. tptacek ◴[22 Jul 23 00:11 UTC] No.36821769{4}[source]▶

>>36821279 #

This isn't a boltdb we designed. It's just containerd. I am probably not doing the outage justice, because "blitz and repopulate" is a time-honored strategy here.

↑