Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

797 points burnerbob | 1 comments | 20 Jul 23 23:42 UTC | HN request time: 0.242s | source

Show context

spiderice ◴[21 Jul 23 03:25 UTC] No.36809650[source]▶

There is now a response to the support thread from Fly[1]:

> Hi Folks,

> Just wanted to provide some more details on what happened here, both with the thread and the host issue.

> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.

> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.

> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.

> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.

[1] https://community.fly.io/t/service-interruption-cant-destroy...

replies(10): >>36809693 #>>36809725 #>>36809824 #>>36809928 #>>36810269 #>>36810740 #>>36811025 #>>36812597 #>>36812956 #>>36813681 #

mrcwinn ◴[21 Jul 23 03:39 UTC] No.36809725[source]▶

>>36809650 #

For what it’s worth, I left Fly because of this crap. At first my Fly machine web app had intermittent connection issues to a new production PG machine. Then my PG machine died. Hard. I lost all data. A restart didn’t work - it could not recover. I restored an older backup over at RDS and couldn’t be happier I left.

replies(5): >>36809880 #>>36810018 #>>36810039 #>>36810724 #>>36814012 #

steve_adams_86 ◴[21 Jul 23 04:07 UTC] No.36809880[source]▶

>>36809725 #

I left digitalocean for fly because some of their tooling was excellent. I was pretty excited.

I’m back on digitalocean now. I’m not unhappy about it, they’re very solid. I don’t love some things about their services, but overall I’d highly recommend them to other developers.

I gave up on fly because I’d spontaneously be unable to automate deployments due to limited resources. Or I’d have previously happy deployments go missing with no automatic recovery. I didn’t realize this was happening to a number of my services until I started monitoring with 3rd party tools, and it became evident that I really couldn’t rely on them.

It’s a shame because I do like a lot of other things about them. Even for hobby work it didn’t seem worth the trouble. With digitalocean, everything “just works”. There’s no free tier, but the lower end of pricing means I can run several Go apps off of the same droplet for less than the price of a latte. It’s worth the sanity.

replies(4): >>36810127 #>>36810379 #>>36813660 #>>36813890 #

NicoJuicy ◴[21 Jul 23 05:44 UTC] No.36810379[source]▶

>>36809880 #

I moved from DO to Hetzner ( cheaper), I am happy about it.

replies(7): >>36810595 #>>36810697 #>>36810760 #>>36810809 #>>36810954 #>>36812172 #>>36813077 #

YetAnotherNick ◴[21 Jul 23 06:50 UTC] No.36810760[source]▶

>>36810379 #

Does anyone know how Hetzner pricing is half of DO yet is profitable, while DO is loss making with 6% operating margin?

replies(6): >>36810793 #>>36811511 #>>36811571 #>>36811650 #>>36812116 #>>36812917 #

1. fxtentacle ◴[21 Jul 23 10:21 UTC] No.36812116[source]▶

>>36810760 #

I've been with them for a long time and my guesses would be:

1. Strict rules and strict customer verification. Crypto mining that wastes SSDs is not allowed. Portscans, mass emails, etc. are not allowed. They also don't offer GPUs to the general public because it has been abused in the past. You usually need to send in ID documents just to open an account. My guess is this allows them to avoid most bad actors and, thereby, waste less money on fraud.

2. Extremely long-term investments. They typically build their own hardware and then use it over 10 years. They have their own flea market where you can rent older server models for a steep discount. That means they will have a long time where the hardware is fully paid off and still generating revenue.

3. Great service. With a mid-sized company, I can call their technicians in the middle of the night. The fact that we could call them in case of a crisis has generated A LOT of good will. But I would be truly surprised if they didn't make a profit off those phone calls, as they charge roughly 4x the salary cost.

4. High-margin managed services. In addition to just the cheap servers, they also offer a managed service where they will do OS and security upgrades for you. It's roughly 2x the price of the server and it appears to be almost fully automated. I know some freelance web designers who will insist on using Hetzner Managed for deployment for their clients, because it is just so convenient. You effectively pass off all recurring maintenance for €300 a month and your client is happy to have an emergency phone number (see #3) in case the box goes down.

↑