Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

Show context

throwawaaarrgh ◴[21 Jul 23 13:08 UTC] No.36813314[source]▶

>>36808296 (OP) #

There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #

seti0Cha ◴[21 Jul 23 14:36 UTC] No.36814300[source]▶

>>36813314 #

Not a great summary from my perspective. Here's what I got out of it:

- Their free tier support depended on noticing message board activity and they didn't.

- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.

- They had an unusually long outage for one particular server.

- Those points combined resulted in many people experiencing an unexplained prolonged outage.

- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.

- Some silliness with Discourse tags caused people to think they were trying to hide the problems.

In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.

replies(2): >>36814431 #>>36814471 #

CSSer ◴[21 Jul 23 14:49 UTC] No.36814471[source]▶

>>36814300 #

I'm surprised by your risk tolerance. If I had any cloud service at this level in my stack go down for three days, I'd start shopping for an alternative. This exceeds the level of acceptability for me for even non-HA requirements. After all, if I can't trust them for this, why would I ever consider giving them my HA business? Just based on napkin math for us, this could've been a potential loss of nearly half a million dollars. Up until this point, I've looked at Fly.io's approach to PR and their business as unconventional but endearing. Now I'm beginning to look at them as unserious. I'm sorry if that sounds harsh. It's the cold truth.

replies(2): >>36815583 #>>36815910 #

1. tinco ◴[21 Jul 23 16:33 UTC] No.36815910[source]▶

>>36814471 #

I think you're not exposed enough to the reality of hardware. There was no need for the host to come back online at all. I think it was a mistake of Fly.io to even attempt to do it. Just say tell the customer the host was lost and offer them a new one (with a freshly zeroed volume attached). You rent a machine, it breaks, you get a new one.

If they're sad that they lost their data, it's their fault for running on a single host with no backup. By actually performing an (apparently) difficult recovery, they reinforced their customers erroneous expectation that they are somehow responsible for the integrity of the data on any single host.

replies(3): >>36817834 #>>36820919 #>>36824212 #

2. itake ◴[21 Jul 23 18:47 UTC] No.36817834[source]▶

>>36815910 (TP) #

Is this the posture of other hosting providers? If not, it seems other hosting providers offer better quality of service.

replies(1): >>36818191 #

3. tinco ◴[21 Jul 23 19:14 UTC] No.36818191[source]▶

>>36817834 #

I would think so, it's honestly strange to think about. The idea of having the node come back after it broke is a bit ridiculous to me. A node breaks, you delete it from your interface and provision a new one, the idea of even waiting 5 minutes for it to come up is strange. This whole conversation seems detached from how the cloud is supposed to and has operated in the past decade.

4. CSSer ◴[21 Jul 23 22:39 UTC] No.36820919[source]▶

>>36815910 (TP) #

In hindsight I wish I could edit because my above comment was pretty trigger happy and focused overly focused on the amount of downtime. It was colored by some existing preconceptions I had about Fly, and I'm honestly surprised it continues to be upvoted. When I made this comment I hadn't yet learned some of the bits you mentioned here at the end from another thread. Anyway, I tend to agree overall. I actually suggested Fly even reconsider offering this configuration given that they refer to it as a "single-node cluster", which is an oxymoron.

5. Dylan16807 ◴[22 Jul 23 07:34 UTC] No.36824212[source]▶

>>36815910 (TP) #

They're not responsible for extreme data recovery, but (almost?) all of the customer data volumes on that server were completely intact. They damn well should be responsible for getting that data back to their customers, whether or not they get the server going again.

If you run off a single drive, and the drive dies, any resulting data loss is your fault. But not if something else dies.

replies(1): >>36824655 #

6. markonen ◴[22 Jul 23 09:07 UTC] No.36824655[source]▶

>>36824212 #

I'm absolutely 100% certain that AWS (for example) wouldn't do that for you with the instance types that feature direct attached storage.

replies(1): >>36827975 #

7. Dylan16807 ◴[22 Jul 23 17:19 UTC] No.36827975{3}[source]▶

>>36824655 #

Directly attached storage in AWS is a special niche that disappears when you so much as hibernate. And even then they talk about how disk failure loses the data but power failure won't.

This is much closer to EBS breaking. It happens sometimes, but if the data is easily accessible then it shouldn't get tossed.

↑