Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

797 points burnerbob | 2 comments | 20 Jul 23 23:42 UTC | HN request time: 0.436s | source

Show context

throwawaaarrgh ◴[21 Jul 23 13:08 UTC] No.36813314[source]▶

>>36808296 (OP) #

There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #

seti0Cha ◴[21 Jul 23 14:36 UTC] No.36814300[source]▶

>>36813314 #

Not a great summary from my perspective. Here's what I got out of it:

- Their free tier support depended on noticing message board activity and they didn't.

- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.

- They had an unusually long outage for one particular server.

- Those points combined resulted in many people experiencing an unexplained prolonged outage.

- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.

- Some silliness with Discourse tags caused people to think they were trying to hide the problems.

In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.

replies(2): >>36814431 #>>36814471 #

CSSer ◴[21 Jul 23 14:49 UTC] No.36814471[source]▶

>>36814300 #

I'm surprised by your risk tolerance. If I had any cloud service at this level in my stack go down for three days, I'd start shopping for an alternative. This exceeds the level of acceptability for me for even non-HA requirements. After all, if I can't trust them for this, why would I ever consider giving them my HA business? Just based on napkin math for us, this could've been a potential loss of nearly half a million dollars. Up until this point, I've looked at Fly.io's approach to PR and their business as unconventional but endearing. Now I'm beginning to look at them as unserious. I'm sorry if that sounds harsh. It's the cold truth.

replies(2): >>36815583 #>>36815910 #

1. mrkurt ◴[21 Jul 23 16:10 UTC] No.36815583[source]▶

>>36814471 #

You're saying a single server failure is going to to cost your business half a million dollars?

This was a server with local NVMe storage. The simplest thing to do would have been to just get rid of it, but we have quite a few free users with data they care about running on single node Postgres (because it's cheaper). It seemed like a better idea to recover this thing.

replies(1): >>36821027 #

2. CSSer ◴[21 Jul 23 22:49 UTC] No.36821027[source]▶

>>36815583 (TP) #

No, it wouldn't, at least not given the contextual details of this situation because we wouldn't do that. Honestly there are parts of my above comment that hold but I admit in the moment that it was a bit impulsive of me because I hadn't yet learned all of the details necessary to make that judgment call. That number is right under slightly different circumstances if you're asking, but it sounds like you were trying to prove a point. If that's true, you succeeded. I learned a bit later that what they were calling a cluster was a single server and that's just... yeah.

↑