Most active commenters
  • CSSer(6)
  • tptacek(5)
  • seti0Cha(4)
  • throwawaaarrgh(3)
  • CoolCold(3)
  • TheDong(3)
  • Jupe(3)
  • tinco(3)

←back to thread

797 points burnerbob | 52 comments | | HN request time: 1.177s | source | bottom
1. throwawaaarrgh ◴[] No.36813314[source]
There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

replies(6): >>36814300 #>>36814376 #>>36814608 #>>36814689 #>>36816612 #>>36817532 #
2. seti0Cha ◴[] No.36814300[source]
Not a great summary from my perspective. Here's what I got out of it:

- Their free tier support depended on noticing message board activity and they didn't.

- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.

- They had an unusually long outage for one particular server.

- Those points combined resulted in many people experiencing an unexplained prolonged outage.

- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.

- Some silliness with Discourse tags caused people to think they were trying to hide the problems.

In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.

replies(2): >>36814431 #>>36814471 #
3. malablaster ◴[] No.36814376[source]
> there’s a lot of bullshit

…proceeds to make a bunch of non-factual statements.

4. tptacek ◴[] No.36814431[source]
Apologies for repeating myself, but:

You get to a certain number of servers and the probability on any one day that some server somewhere is going to hiccup and bounce gets pretty high. That's what happened here: a single host in Sydney, one of many, had a problem.

When we have an incident with a single host, we update a notification channel for people with instances on that host. They are a tiny sliver of all our users, but of course that's cold comfort for them; they're experiencing an outage! That's what happened here: we did the single-host notification thing for users with apps on that Sydney host.

Normally, when we have a single-host incident, the host is back online pretty quickly. Minutes, maybe double-digit minutes if something gnarly happened. About once every 18 months or so, something worse than gnarly happens to a server (they're computers, we're not magic, all the bad things that happen to computers happen to us too). That's what happened here: we had an extended single-host outage, one that lasted over 12 hours.

(Specifically, if you're interested: somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache for OCI container images backing flyd; if containerd goes down, no new machines can start on the host. It took a member of our team, also a containerd maintainer, several hours to do battlefield surgery on that boltdb to bring the host back up.)

Now, as you can see from the fact that we were at the top of HN all night, there is a difference between a 5 minute single-host incident and a 12-hour single-host outage. Our runbook for single-host problems is tuned for the former. 12-hour single-host outages are pretty rare, and we probably want to put them on the global status page (I'm choosing my words carefully because we have an infra team and infra management and I'm not on it, and I don't want to speak for them or, worse, make commitments for them, all I can say is I get where people are coming with this one).

replies(4): >>36814531 #>>36814688 #>>36815167 #>>36816132 #
5. CSSer ◴[] No.36814471[source]
I'm surprised by your risk tolerance. If I had any cloud service at this level in my stack go down for three days, I'd start shopping for an alternative. This exceeds the level of acceptability for me for even non-HA requirements. After all, if I can't trust them for this, why would I ever consider giving them my HA business? Just based on napkin math for us, this could've been a potential loss of nearly half a million dollars. Up until this point, I've looked at Fly.io's approach to PR and their business as unconventional but endearing. Now I'm beginning to look at them as unserious. I'm sorry if that sounds harsh. It's the cold truth.
replies(2): >>36815583 #>>36815910 #
6. CSSer ◴[] No.36814531{3}[source]
Why are your customers exposed to this? This sounds like a tough problem that I'm sympathetic to for you personally, but it sounds like there's no failover or appropriate redundancy in place to rollover to while you work to fix the problem.

edit: I hope this comment doesn't sound accusatory. At the end of the day I want everyone to succeed. I hope there's a silver lining to this in the post-mortem.

replies(1): >>36814624 #
7. 3oH2y869 ◴[] No.36814608[source]
I've personally had this experience with Fly on a personal project. My project went down but their status pages said everything was up. It's fine since it's personal for fun project but for anything more serious I don't know if I'd be comfortable using them.
8. tptacek ◴[] No.36814624{4}[source]
The way to not be exposed to this is to run an HA configuration with more than one instance.

If you're running an app on Fly.io without local durable storage, then it's easy to fail over to another server. But durable storage on Fly.io is attached NVMe storage.

By far the most common way people use durable storage on Fly.io is with Postgres databases. If you're doing that on Fly.io, we automatically manage failover at the application layer: you run multiple instances, they configure themselves in a single-writer multi-reader cluster, and if the leader fails, a replica takes over.

We will let you run a single-instance Postgres "cluster", and people definitely do that. The downside to that configuration is, if the host you're on blows up, your availability can take a hit. That's just how the platform works.

replies(2): >>36814820 #>>36815463 #
9. seti0Cha ◴[] No.36814688{3}[source]
It seems to me like there's room for improving your customers' awareness around what is required for HA and how to tell when they are affected by a hardware issue. On the other hand, it may just be that the confusion is mostly amongst the casual onlookers, in which case you have my sympathies!
replies(1): >>36814971 #
10. wgjordan ◴[] No.36814689[source]
(Fly.io employee here)

To clarify, we communicated this incident to the personalized status page [1] of all affected customers within 30 minutes of this single host going down, and resolved the incident on the status page once it was resolved ~47h later. Here's the timeline (UTC):

- 2023-07-17 16:19 - host goes down

- 2023-07-17 16:49 - issue posted to personalized status page

- 2023-07-19 15:00 - host is fixed

- 2023-07-19 15:17 - issue marked resolved on status page

[1] https://community.fly.io/t/new-status-page/11398

replies(2): >>36815073 #>>36816412 #
11. CSSer ◴[] No.36814820{5}[source]
I see. Have you considered eliminating this configuration from your offering? It sounds like the terminology could confuse people, and it may be the case that they're assuming that a host isn't really what it is (a single host). This kind of thing is difficult for those seeking to build managed services, because I think people expect you to provide offerings that can't harm them when the cause is related to the service they're paying for and it's difficult to figure out which sharp objects they understand and which ones they don't. People should know better, but if they did would they need you?

If this sounds ludicrous, then I think I probably don't understand who Fly.io wants to be and that's okay. If I don't understand, however, you may want to take a look at your image and messaging to potentially recalibrate what kind of customers you're attracting.

replies(1): >>36815012 #
12. CoolCold ◴[] No.36814971{4}[source]
I'm not sure on this, will it make any sense - customers who DON'T WANT to be aware of what is required for HA (say lonely devs) choosing such a hosting types. Even if you put educational articles, I'm unsure it will be used. Putting some BANNER IN RED LETTERS into CLI output + link to article may work, though.

What do you think?

replies(2): >>36815207 #>>36815219 #
13. TheDong ◴[] No.36815012{6}[source]
Plenty of people would rather take downtime than pay for redundancy, for example for a test database.

AWS RDS lets you spin up a RDS instance that costs 3x less and regularly has downtime (the 'single-az' one), quite similar to this.

Anyone who's used servers before knows "A single instance" is the same as "sometimes you might have downtime".

Computers aren't magic, everyone from heroku (you must have multiple dynos to be high availability) to ec2 (multiple instances across AZs) agree on "a single machine is not redundant". I don't see how fly's messaging is out of line with that. They don't tell you anywhere "Our apps and machines are literally magic and will never fail".

replies(2): >>36816557 #>>36821067 #
14. Jupe ◴[] No.36815073[source]
Ouch?

The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days. For an entire cluster to be down for that long is just unacceptable. Rebuilding a cluster from the last-known-good backup should not take that long, unless there are PBs of data involved; dividing such large data stores into separate clusters/instances seems warranted. Solution archs should steer customers to multiple, smaller clusters (sharding) whenever possible. It is far better to have some customers impacted (or just some of your customer's customers) than have all impacted, in my not so humble opinion.

And, if the data size is smaller, you may want to trigger a full rebuild earlier in your DR workflows just as an insurance policy.

The good news is that only a single cluster was impacted. When the "big boys" go down, everything is impacted... but customers don't really care about that.

Not sure if this impacted customer had other instances that were working for them?

replies(3): >>36815213 #>>36815280 #>>36817710 #
15. kunley ◴[] No.36815167{3}[source]
> somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache

Hey, even if I can feel sympathetic for the course of unfortunate events, it's hard to not to comment:

if you're using a cache, you should invalidate it on failure!

replies(1): >>36815351 #
16. wgjordan ◴[] No.36815207{5}[source]
This is exactly how it currently works:

  $ fly volumes create mydata
  Warning! Individual volumes are pinned to individual hosts.
  You should create two or more volumes per application.
  You will have downtime if you only create one.
  Learn more at https://fly.io/docs/reference/volumes/
  ? Do you still want to use the volumes feature? (y/N)
(and yes, the warning is already even in red letters too)
replies(1): >>36816705 #
17. mrkurt ◴[] No.36815213{3}[source]
This was a single physical server running multiple VMs using local NVMe storage. It impacted a small fraction of customers.
18. seti0Cha ◴[] No.36815219{5}[source]
I agree, articles tend not to get read by those who need them most. A warning from the CLI and a banner on the app management page with a link to a detailed explanation would seem like a good approach.

edit: sibling post shows there is such a message on the CLI. The only other thing I can think of is an "Are you sure you want to do this?" prompt, but in the end you can't reach everybody.

replies(2): >>36816844 #>>36816892 #
19. TheDong ◴[] No.36815280{3}[source]
> The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days.

There was one physical server down. That's it. They even brought it back.

I've had AWS delete more instances, including all local NVMe store data, than I can count on my hands. Just in the last year.

Those instances didn't experience 47 hours downtime, they experienced infinite downtime, gone forever.

I guess by your standard I'd be fired for using AWS too.

But no, in reality, AWS deletes or migrates your instances all the time due to host hardware failure, and it's fine because if you know what you're doing, you have multiple instances across multiple AZs.

The same is true of fly. Sometimes underlying hardware fails (exactly like on AWS), and when that happens, you have to either have other copies of your app, or accept downtime.

I'll also add that the downtime is only 47 hours for you if you don't have the ability to spin up a new copy on a separate fly host or AZ in the meanwhile.

replies(3): >>36815479 #>>36815526 #>>36818750 #
20. tptacek ◴[] No.36815351{4}[source]
It's a read-through cache. This wasn't a cache invalidation issue. It's a systems-level state corruption problem that just happened to break a system used primarily as a cache.
replies(1): >>36821279 #
21. kevin_nisbet ◴[] No.36815463{5}[source]
Unless something has changed and I'm out of date, I think a piece of context here is fly postgres isn't really a managed service offering. From what I've seen fly does try to message this, but I think it's still easy for some subset of customers to miss that they're deploying an OSS component, maybe deployed a non-HA setup and forgot, and it's not the same as buying a database as a service.

So hopefully as fly.io get's more popular, there will be some compelling managed offerings. I saw comments at one point from the neon CEO about a fly.io offering, but not sure if that went anywhere. I'm sure customers can also use crunchy, or other offerings.

22. the_duke ◴[] No.36815479{4}[source]
The core issue here is that fly doesn't offer distributed storage, only local disks.

Combine that with them having tooling for setting up Postgres built on top of single node storage, and you have the downtime problems and unhappy customers as a given.

23. yjftsjthsd-h ◴[] No.36815526{4}[source]
When does AWS delete instances? Migrate, sure, and yes, local storage is supposed to be treated as disposable for that reason, but AFAIK only spot instances should be able to be destroyed outright.
replies(2): >>36815597 #>>36820037 #
24. mrkurt ◴[] No.36815583{3}[source]
You're saying a single server failure is going to to cost your business half a million dollars?

This was a server with local NVMe storage. The simplest thing to do would have been to just get rid of it, but we have quite a few free users with data they care about running on single node Postgres (because it's cheaper). It seemed like a better idea to recover this thing.

replies(1): >>36821027 #
25. TheDong ◴[] No.36815597{5}[source]
To quote from their docs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance...

> If your instance root device is an instance store volume, the instance is terminated, and cannot be used again.

See also the aws "Dedicated Hosts" and "Mac Instances". Those also have similar termination behavior.

The majority of my instances lost are from the instance store thing.

26. tinco ◴[] No.36815910{3}[source]
I think you're not exposed enough to the reality of hardware. There was no need for the host to come back online at all. I think it was a mistake of Fly.io to even attempt to do it. Just say tell the customer the host was lost and offer them a new one (with a freshly zeroed volume attached). You rent a machine, it breaks, you get a new one.

If they're sad that they lost their data, it's their fault for running on a single host with no backup. By actually performing an (apparently) difficult recovery, they reinforced their customers erroneous expectation that they are somehow responsible for the integrity of the data on any single host.

replies(3): >>36817834 #>>36820919 #>>36824212 #
27. laweijfmvo ◴[] No.36816132{3}[source]
> over 12 hours

How much is over 12 hours? 12 hours and 10 minutes? 13 hours? 67 days?

28. throwawaaarrgh ◴[] No.36816412[source]
Dude. I don't sit at home refreshing status pages. Send me an e-mail.

That's how other [useful] providers notify their customers that one of their hosts went down unexpectedly. Linode will send me 6 emails when they need to reboot something. Even Oracle sends me notices about network blips. I believe I've gotten one from AWS, but I also know sometimes their gear gets stuck in a bad state and I didn't get a notification, which was super annoying because it took forever to figure out it was AWS's faulty state.

replies(1): >>36831966 #
29. remram ◴[] No.36816557{7}[source]
Single-AZ i not single-host though, and while a single AZ can go down for major events, it doesn't break because a single piece of hardware failed.
replies(1): >>36816666 #
30. tinco ◴[] No.36816612[source]
Haha, imagine what the AWS status page would look like if they had to update their global status page anytime a single host would go down in any region.

Fly.io messed up, they didn't want to be a Heroku clone, but their marketing and their polished user experience design made it seem like they would be one anyway.

And as a reward now they have to deal with bottom of the barrel Heroku users that manage to do major damage to their brand whenever a single host goes down. Who would have predicted that corporate risk?

31. makoz ◴[] No.36816666{8}[source]
Sure, but isn't this more about risk tolerance at this point and how much your customers care about? Where the responsibility should be on customer's end. Running on EBS/RDS doesn't guarantee you won't lose data. If you care about it, you enable backups and test recovery.

Just because some customers are less fault tolerant than others, doesn't mean we shouldn't offer those options where people don't have the same requirements or are willing to work around it.

32. CoolCold ◴[] No.36816705{6}[source]
Sounds like it have not help already - no even need to guess. One of the moments you are in mixed feelings about being right.
33. CoolCold ◴[] No.36816844{6}[source]
Indeed
34. tptacek ◴[] No.36816892{6}[source]
There is an "Are you sure want to do this?" prompt!
replies(1): >>36817856 #
35. WuxiFingerHold ◴[] No.36817532[source]
>> There's a lot of bullshit in this HN thread Then consider replying directly to the post containing wrong information instead of making such generalised accusation.

>> I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. What other cloud providers have downtimes of 20 hours? There must be a lot to call this "guaranteed and normal".

Sadly, I've always felt a good amount of passive aggressiveness in many of the HN threads where fly.io is involved.

36. makoz ◴[] No.36817710{3}[source]
Disclaimer work in AWS.

> Rebuilding a cluster from the last-known-good backup should not take that long

It's not even clear if that's the right thing to do as a service provider.

Let's say you host a database on some database service, and the entire host is lost. I don't think you want the service provider to restore automatically from the last backup because it makes assumptions about what data loss you're tolerant to. If it just works from the last backup, suddenly you're potentially missing a day of transactions that you thought were there that magically disappears as opposed to knowing they disappeared from a hard break.

replies(1): >>36818722 #
37. itake ◴[] No.36817834{4}[source]
Is this the posture of other hosting providers? If not, it seems other hosting providers offer better quality of service.
replies(1): >>36818191 #
38. seti0Cha ◴[] No.36817856{7}[source]
Make them type the phrase "I'm OK with downtimes of arbitrary length"!

I kid, seems like you guys did what you could.

39. tinco ◴[] No.36818191{5}[source]
I would think so, it's honestly strange to think about. The idea of having the node come back after it broke is a bit ridiculous to me. A node breaks, you delete it from your interface and provision a new one, the idea of even waiting 5 minutes for it to come up is strange. This whole conversation seems detached from how the cloud is supposed to and has operated in the past decade.
40. Jupe ◴[] No.36818722{4}[source]
Restoring from backup doesn't mean you actually have to use it - just prepare it in case you need it. Since this can take time, starting such a restore early would be an insurance policy, if needed. If there are snapshots to apply after the last-known-good backup, all the better.
41. Jupe ◴[] No.36818750{4}[source]
Since the post said "cluster", I assumed it was a set of instances with replicas and the like.

I've never experienced AWS killing nodes forever; at least not DB instances.

42. yencabulator ◴[] No.36820037{5}[source]
The underlying problem is that Fly doesn't provide non-local, less-eager-to-disappear, volumes.
43. CSSer ◴[] No.36820919{4}[source]
In hindsight I wish I could edit because my above comment was pretty trigger happy and focused overly focused on the amount of downtime. It was colored by some existing preconceptions I had about Fly, and I'm honestly surprised it continues to be upvoted. When I made this comment I hadn't yet learned some of the bits you mentioned here at the end from another thread. Anyway, I tend to agree overall. I actually suggested Fly even reconsider offering this configuration given that they refer to it as a "single-node cluster", which is an oxymoron.
44. CSSer ◴[] No.36821027{4}[source]
No, it wouldn't, at least not given the contextual details of this situation because we wouldn't do that. Honestly there are parts of my above comment that hold but I admit in the moment that it was a bit impulsive of me because I hadn't yet learned all of the details necessary to make that judgment call. That number is right under slightly different circumstances if you're asking, but it sounds like you were trying to prove a point. If that's true, you succeeded. I learned a bit later that what they were calling a cluster was a single server and that's just... yeah.
45. CSSer ◴[] No.36821067{7}[source]
I don't disagree. I was latching onto the idea that people are running single-node "clusters". Whatever it is, it isn't a cluster.
46. kunley ◴[] No.36821279{5}[source]
What I meant is that if the compromised host was unable to use broken boltdb cache, the cache should be zeroed and repopulated. Was it really hours of such cache rebuild vs hours of trying to fix the boltdb?

Btw I am happy I got only small amounts of data in any of bolt databases...

replies(1): >>36821769 #
47. tptacek ◴[] No.36821769{6}[source]
This isn't a boltdb we designed. It's just containerd. I am probably not doing the outage justice, because "blitz and repopulate" is a time-honored strategy here.
48. Dylan16807 ◴[] No.36824212{4}[source]
They're not responsible for extreme data recovery, but (almost?) all of the customer data volumes on that server were completely intact. They damn well should be responsible for getting that data back to their customers, whether or not they get the server going again.

If you run off a single drive, and the drive dies, any resulting data loss is your fault. But not if something else dies.

replies(1): >>36824655 #
49. markonen ◴[] No.36824655{5}[source]
I'm absolutely 100% certain that AWS (for example) wouldn't do that for you with the instance types that feature direct attached storage.
replies(1): >>36827975 #
50. Dylan16807 ◴[] No.36827975{6}[source]
Directly attached storage in AWS is a special niche that disappears when you so much as hibernate. And even then they talk about how disk failure loses the data but power failure won't.

This is much closer to EBS breaking. It happens sometimes, but if the data is easily accessible then it shouldn't get tossed.

51. CameronNemo ◴[] No.36831966{3}[source]
How do you know emails weren't set in addition to the status page changes?
replies(1): >>36838724 #
52. throwawaaarrgh ◴[] No.36838724{4}[source]
The whole point of this HN thread is customers weren't getting regular updates. If they had they wouldn't be on a random community forum trying to get support's attention.