←back to thread

797 points burnerbob | 1 comments | | HN request time: 0.202s | source
Show context
throwawaaarrgh ◴[] No.36813314[source]
There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #
wgjordan ◴[] No.36814689[source]
(Fly.io employee here)

To clarify, we communicated this incident to the personalized status page [1] of all affected customers within 30 minutes of this single host going down, and resolved the incident on the status page once it was resolved ~47h later. Here's the timeline (UTC):

- 2023-07-17 16:19 - host goes down

- 2023-07-17 16:49 - issue posted to personalized status page

- 2023-07-19 15:00 - host is fixed

- 2023-07-19 15:17 - issue marked resolved on status page

[1] https://community.fly.io/t/new-status-page/11398

replies(2): >>36815073 #>>36816412 #
Jupe ◴[] No.36815073[source]
Ouch?

The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days. For an entire cluster to be down for that long is just unacceptable. Rebuilding a cluster from the last-known-good backup should not take that long, unless there are PBs of data involved; dividing such large data stores into separate clusters/instances seems warranted. Solution archs should steer customers to multiple, smaller clusters (sharding) whenever possible. It is far better to have some customers impacted (or just some of your customer's customers) than have all impacted, in my not so humble opinion.

And, if the data size is smaller, you may want to trigger a full rebuild earlier in your DR workflows just as an insurance policy.

The good news is that only a single cluster was impacted. When the "big boys" go down, everything is impacted... but customers don't really care about that.

Not sure if this impacted customer had other instances that were working for them?

replies(3): >>36815213 #>>36815280 #>>36817710 #
makoz ◴[] No.36817710[source]
Disclaimer work in AWS.

> Rebuilding a cluster from the last-known-good backup should not take that long

It's not even clear if that's the right thing to do as a service provider.

Let's say you host a database on some database service, and the entire host is lost. I don't think you want the service provider to restore automatically from the last backup because it makes assumptions about what data loss you're tolerant to. If it just works from the last backup, suddenly you're potentially missing a day of transactions that you thought were there that magically disappears as opposed to knowing they disappeared from a hard break.

replies(1): >>36818722 #
1. Jupe ◴[] No.36818722[source]
Restoring from backup doesn't mean you actually have to use it - just prepare it in case you need it. Since this can take time, starting such a restore early would be an insurance policy, if needed. If there are snapshots to apply after the last-known-good backup, all the better.