Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

797 points burnerbob | 2 comments | 20 Jul 23 23:42 UTC | HN request time: 0s | source

Show context

spiderice ◴[21 Jul 23 03:25 UTC] No.36809650[source]▶

There is now a response to the support thread from Fly[1]:

> Hi Folks,

> Just wanted to provide some more details on what happened here, both with the thread and the host issue.

> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.

> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.

> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.

> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.

[1] https://community.fly.io/t/service-interruption-cant-destroy...

replies(10): >>36809693 #>>36809725 #>>36809824 #>>36809928 #>>36810269 #>>36810740 #>>36811025 #>>36812597 #>>36812956 #>>36813681 #

benjaminwootton ◴[21 Jul 23 07:30 UTC] No.36811025[source]▶

>>36809650 #

Should losing a single host machine be a big deal nowadays? Instance failure is a fact of life.

Even if customers are only running one instance, I would expect the whole thing to rebalance in an automated way especially with fly.io being so container centric.

It also sounds like this is some managed Postgres service rather than users running only one instance of their container, so it’s even more reasonable to expect resilience to host failure?

replies(3): >>36811755 #>>36811788 #>>36813069 #

smallerfish ◴[21 Jul 23 09:27 UTC] No.36811788[source]▶

>>36811025 #

If you lose a single instance on RDS and you don't have replication set up, you'll also have downtime. (Maybe not with Aurora?)

And +1 to the sibling comment; Fly makes it very clear that single instance postgres isn't HA, and talks about what you need to do architecturally to maintain uptime.

replies(3): >>36812045 #>>36812724 #>>36813726 #

marcinzm ◴[21 Jul 23 10:08 UTC] No.36812045[source]▶

>>36811788 #

Downtime but limited downtime since the data is stored with redundantly across multiple machines in the same AZ. So unless the AZ goes down (which is a different failure than what happened here) you can restart the DB on a different instance pretty quickly and I'm guessing AWS will do it automatically for you.

edit: Remove triple as not certain about level of redundancy

replies(1): >>36812543 #

1. truetraveller ◴[21 Jul 23 11:27 UTC] No.36812543[source]▶

>>36812045 #

I don't believe their RDS / EBS has 3x redundancy. With SSD, that would be super costly for them. But if that's correct, that would be incredible.

replies(1): >>36812731 #

2. marcinzm ◴[21 Jul 23 11:48 UTC] No.36812731[source]▶

>>36812543 (TP) #

May not be 3x but it is replicated so even a total instance failure would not make you lose data:

>Amazon EBS volumes are designed to be highly available, reliable, and durable. At no additional charge to you, Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. For more details, see the Amazon EBS Service Level Agreement.

https://aws.amazon.com/ebs/features/#Amazon_EBS_availability...

↑