Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

797 points burnerbob | 1 comments | 20 Jul 23 23:42 UTC | HN request time: 0.21s | source

Show context

throwawaaarrgh ◴[21 Jul 23 13:08 UTC] No.36813314[source]▶

>>36808296 (OP) #

There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #

wgjordan ◴[21 Jul 23 15:06 UTC] No.36814689[source]▶

>>36813314 #

(Fly.io employee here)

To clarify, we communicated this incident to the personalized status page [1] of all affected customers within 30 minutes of this single host going down, and resolved the incident on the status page once it was resolved ~47h later. Here's the timeline (UTC):

- 2023-07-17 16:19 - host goes down

- 2023-07-17 16:49 - issue posted to personalized status page

- 2023-07-19 15:00 - host is fixed

- 2023-07-19 15:17 - issue marked resolved on status page

[1] https://community.fly.io/t/new-status-page/11398

replies(2): >>36815073 #>>36816412 #

Jupe ◴[21 Jul 23 15:38 UTC] No.36815073[source]▶

>>36814689 #

Ouch?

The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days. For an entire cluster to be down for that long is just unacceptable. Rebuilding a cluster from the last-known-good backup should not take that long, unless there are PBs of data involved; dividing such large data stores into separate clusters/instances seems warranted. Solution archs should steer customers to multiple, smaller clusters (sharding) whenever possible. It is far better to have some customers impacted (or just some of your customer's customers) than have all impacted, in my not so humble opinion.

And, if the data size is smaller, you may want to trigger a full rebuild earlier in your DR workflows just as an insurance policy.

The good news is that only a single cluster was impacted. When the "big boys" go down, everything is impacted... but customers don't really care about that.

Not sure if this impacted customer had other instances that were working for them?

replies(3): >>36815213 #>>36815280 #>>36817710 #

TheDong ◴[21 Jul 23 15:51 UTC] No.36815280[source]▶

>>36815073 #

> The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days.

There was one physical server down. That's it. They even brought it back.

I've had AWS delete more instances, including all local NVMe store data, than I can count on my hands. Just in the last year.

Those instances didn't experience 47 hours downtime, they experienced infinite downtime, gone forever.

I guess by your standard I'd be fired for using AWS too.

But no, in reality, AWS deletes or migrates your instances all the time due to host hardware failure, and it's fine because if you know what you're doing, you have multiple instances across multiple AZs.

The same is true of fly. Sometimes underlying hardware fails (exactly like on AWS), and when that happens, you have to either have other copies of your app, or accept downtime.

I'll also add that the downtime is only 47 hours for you if you don't have the ability to spin up a new copy on a separate fly host or AZ in the meanwhile.

replies(3): >>36815479 #>>36815526 #>>36818750 #

1. Jupe ◴[21 Jul 23 19:51 UTC] No.36818750[source]▶

>>36815280 #

Since the post said "cluster", I assumed it was a set of instances with replicas and the like.

I've never experienced AWS killing nodes forever; at least not DB instances.

↑