←back to thread

797 points burnerbob | 4 comments | | HN request time: 0.001s | source
Show context
spiderice ◴[] No.36809650[source]
There is now a response to the support thread from Fly[1]:

> Hi Folks,

> Just wanted to provide some more details on what happened here, both with the thread and the host issue.

> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.

> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.

> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.

> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.

[1] https://community.fly.io/t/service-interruption-cant-destroy...

replies(10): >>36809693 #>>36809725 #>>36809824 #>>36809928 #>>36810269 #>>36810740 #>>36811025 #>>36812597 #>>36812956 #>>36813681 #
oefrha ◴[] No.36812597[source]
I was confused why support for platform failure relies on a forum where employees may or may not check. After checking docs[1], apparently you have to be on a paid plan (at least $29/mo) to access email support, so you may not have it even you’re paying for resources.

I won’t be using it for side projects where I’m okay with paying $5-10/mo but don’t want to have three day outages.

[1] https://fly.io/docs/about/support/

replies(1): >>36813041 #
1. MuffinFlavored ◴[] No.36813041[source]
Forewarning: I am not being critical of fly.io nor their free support whatsoever when I say this.

From a technical perspective, could they have "been better" from a technical perspective? I see their name a lot on HN so I know they are doing really cool + advanced things and this is probably some super small edge case that slipped through the cracks.

Could they have added some message / do we as the HN community feel they needed to be like "we're gonna add some extra logging/monitoring going forward so it won't happen again"?

By all means, they probably don't owe anybody in terms of stability + uptime guarantees when it comes to a free tier. Sh*t happens.

replies(3): >>36813087 #>>36813165 #>>36813910 #
2. riwsky ◴[] No.36813087[source]
They broke uptime for the paid tier, not just the free tier.

The relevance of paid/free is that free (and cheap paid) plans don’t get fly support over email

3. azemetre ◴[] No.36813165[source]
They may not owe anyone anything but over time these types of issues can cause a large reputation hit.

If I was just searching online or trying to find out what various communities think about Fly.io and see several threads about major outages with poor communications, do you think I will use their services? It would be an immediate pass.

It takes a long time to build a reputation, and you can lose it instantly.

4. elderlydoofus ◴[] No.36813910[source]
FWIW: I am on the bottom tier of the paid plans ($29/mo) so I could get access to the email support, and even with that their response time is still not great.

I have an ongoing issue with one of my PG clusters where one of the nodes was failing and all my attempts at fixing it are failing (mainly cloning one of the other machines to bring the cluster numbers back to normal).

I emailed my account’s support email mid Friday morning last week and did not hear back until this past Monday night.

Sucks, because like a lot of others in this thread I like what Fly is trying to do and am rooting for them, but IMO they should use a significant chunk of that funding they just received on hiring a ton of SREs and front line customer support.

EDIT: I should add, the past times I have emailed them the response time was good. It's just this most recent time was so egregious (3 days!) to get even that initial response that I bring it up.