←back to thread

797 points burnerbob | 1 comments | | HN request time: 0.212s | source
Show context
spiderice ◴[] No.36809650[source]
There is now a response to the support thread from Fly[1]:

> Hi Folks,

> Just wanted to provide some more details on what happened here, both with the thread and the host issue.

> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.

> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.

> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.

> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.

[1] https://community.fly.io/t/service-interruption-cant-destroy...

replies(10): >>36809693 #>>36809725 #>>36809824 #>>36809928 #>>36810269 #>>36810740 #>>36811025 #>>36812597 #>>36812956 #>>36813681 #
mrcwinn ◴[] No.36809725[source]
For what it’s worth, I left Fly because of this crap. At first my Fly machine web app had intermittent connection issues to a new production PG machine. Then my PG machine died. Hard. I lost all data. A restart didn’t work - it could not recover. I restored an older backup over at RDS and couldn’t be happier I left.
replies(5): >>36809880 #>>36810018 #>>36810039 #>>36810724 #>>36814012 #
quickthrower2 ◴[] No.36810724[source]
Fly is in my “try later book” from a year or two ago. I remember it was hard to deploy anything due to downtime so gave up. Sad that stuff like this still happens.

You shouldn’t need to multi region a postgres yourself - they should have at least 2 data centre redundancy for the region and it just works.

Hope they get some magic sauce to become better at this.

replies(1): >>36811584 #
throwawaymaths ◴[] No.36811584[source]
> Hope they get some magic sauce to become better at this.

When I saw them describe their multiregion SQL replication architecture I thought "what crazy person thought this wouldn't eventually open up a spider's nest of distributed systems errors?"

replies(2): >>36813480 #>>36814468 #
1. api ◴[] No.36813480[source]
CockroachDB does this, but that's the result of over 10 years of heads down hard-ass engineering and it's still slower than Postgres because distributed sync is not free. That means you have to provision it properly and with enough resources.

Their license would require a company like fly.io to pay them though, so I'm sure this resulted in fly.io instead trying to whip up an improvised infrastructure on the back of stock Postgres. I bet this cost them a whole lot more than paying CockroachDB would have, but devs have been conditioned that you should never ever pay for software even if it's the result of tons of deep engineering and solves massive brutal problems for you. I also bet there's some not-invented-here ego involved.

P.S. I don't work for CDB but I would absolutely consider them and we may end up using them at some point. They let you do a ton for free. They only charge for stuff you need if you get really really huge or if you are running a SaaS reselling DB services like fly.io would have been doing.