←back to thread

797 points burnerbob | 10 comments | | HN request time: 0.771s | source | bottom
Show context
throwawaaarrgh ◴[] No.36813314[source]
There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #
wgjordan ◴[] No.36814689[source]
(Fly.io employee here)

To clarify, we communicated this incident to the personalized status page [1] of all affected customers within 30 minutes of this single host going down, and resolved the incident on the status page once it was resolved ~47h later. Here's the timeline (UTC):

- 2023-07-17 16:19 - host goes down

- 2023-07-17 16:49 - issue posted to personalized status page

- 2023-07-19 15:00 - host is fixed

- 2023-07-19 15:17 - issue marked resolved on status page

[1] https://community.fly.io/t/new-status-page/11398

replies(2): >>36815073 #>>36816412 #
1. Jupe ◴[] No.36815073[source]
Ouch?

The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days. For an entire cluster to be down for that long is just unacceptable. Rebuilding a cluster from the last-known-good backup should not take that long, unless there are PBs of data involved; dividing such large data stores into separate clusters/instances seems warranted. Solution archs should steer customers to multiple, smaller clusters (sharding) whenever possible. It is far better to have some customers impacted (or just some of your customer's customers) than have all impacted, in my not so humble opinion.

And, if the data size is smaller, you may want to trigger a full rebuild earlier in your DR workflows just as an insurance policy.

The good news is that only a single cluster was impacted. When the "big boys" go down, everything is impacted... but customers don't really care about that.

Not sure if this impacted customer had other instances that were working for them?

replies(3): >>36815213 #>>36815280 #>>36817710 #
2. mrkurt ◴[] No.36815213[source]
This was a single physical server running multiple VMs using local NVMe storage. It impacted a small fraction of customers.
3. TheDong ◴[] No.36815280[source]
> The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days.

There was one physical server down. That's it. They even brought it back.

I've had AWS delete more instances, including all local NVMe store data, than I can count on my hands. Just in the last year.

Those instances didn't experience 47 hours downtime, they experienced infinite downtime, gone forever.

I guess by your standard I'd be fired for using AWS too.

But no, in reality, AWS deletes or migrates your instances all the time due to host hardware failure, and it's fine because if you know what you're doing, you have multiple instances across multiple AZs.

The same is true of fly. Sometimes underlying hardware fails (exactly like on AWS), and when that happens, you have to either have other copies of your app, or accept downtime.

I'll also add that the downtime is only 47 hours for you if you don't have the ability to spin up a new copy on a separate fly host or AZ in the meanwhile.

replies(3): >>36815479 #>>36815526 #>>36818750 #
4. the_duke ◴[] No.36815479[source]
The core issue here is that fly doesn't offer distributed storage, only local disks.

Combine that with them having tooling for setting up Postgres built on top of single node storage, and you have the downtime problems and unhappy customers as a given.

5. yjftsjthsd-h ◴[] No.36815526[source]
When does AWS delete instances? Migrate, sure, and yes, local storage is supposed to be treated as disposable for that reason, but AFAIK only spot instances should be able to be destroyed outright.
replies(2): >>36815597 #>>36820037 #
6. TheDong ◴[] No.36815597{3}[source]
To quote from their docs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance...

> If your instance root device is an instance store volume, the instance is terminated, and cannot be used again.

See also the aws "Dedicated Hosts" and "Mac Instances". Those also have similar termination behavior.

The majority of my instances lost are from the instance store thing.

7. makoz ◴[] No.36817710[source]
Disclaimer work in AWS.

> Rebuilding a cluster from the last-known-good backup should not take that long

It's not even clear if that's the right thing to do as a service provider.

Let's say you host a database on some database service, and the entire host is lost. I don't think you want the service provider to restore automatically from the last backup because it makes assumptions about what data loss you're tolerant to. If it just works from the last backup, suddenly you're potentially missing a day of transactions that you thought were there that magically disappears as opposed to knowing they disappeared from a hard break.

replies(1): >>36818722 #
8. Jupe ◴[] No.36818722[source]
Restoring from backup doesn't mean you actually have to use it - just prepare it in case you need it. Since this can take time, starting such a restore early would be an insurance policy, if needed. If there are snapshots to apply after the last-known-good backup, all the better.
9. Jupe ◴[] No.36818750[source]
Since the post said "cluster", I assumed it was a set of instances with replicas and the like.

I've never experienced AWS killing nodes forever; at least not DB instances.

10. yencabulator ◴[] No.36820037{3}[source]
The underlying problem is that Fly doesn't provide non-local, less-eager-to-disappear, volumes.