Most active commenters
  • tptacek(20)
  • (12)
  • steve_adams_86(8)
  • marcinzm(8)
  • pinkcan(8)
  • api(7)
  • sho(6)
  • justinclift(6)
  • CSSer(6)
  • skullone(5)

797 points burnerbob | 490 comments | | HN request time: 7.212s | source | bottom
1. burnerbob ◴[] No.36808377[source]
Fly have tried to hush this by making the thread [1] private to anyone not logged in.

One quote from thread:

> This is the second time I’ve had this kind of issue with Fly, where my service just goes down, Fly reports everything healthy, and there’s literally no information and nothing I can really do other than wait and hope it comes back up sometime

Another user:

> We had four machines (app + Postgres for staging and production) running yesterday, and three of the four (including both databases) are still down and can’t be accessed. I can replicate the issues others have mentioned here.

> This is our company’s external API app and so the issue broke all of our integrations.

> Our team ended up setting up a new project in fly to spin up an instance to keep us going which took a couple of hours (backfilling environment variables and configuration etc, not a bad test of our DR ability).

> There is no way I can find to get the data from the db machines. Thank goodness this isn’t our main production db and we were able to reverse engineer what we needed into there.

> Very keen to hear what’s happening with this and why after so many hours there’s no more info or updates.

Another user:

> As an aside, it’s kind of a kick in the teeth to see the status page for our organization reporting no incidents - the same page that lists our apps as under maintenance and inaccessible!

Another user:

> I’m feeling very lucky that none of our paid production apps or databases are affected currently (only our development environment is), but also really surprised that the issue has been ongoing for 17 hours now with no status page update, no notifications (beyond betterstack letting us know it was down) and one note on the app with not much info as to whats going on.

> It really worries me what would happen if it was one of our paid production instances that was affected - the data we’re working with can’t simply be ‘recovered’ later, it’d just get dropped until service resumed or we migrated to another region to get things running again

> Keen to know whats wrong and whats being done about it

Full thread (as at time of HN post; more has been added since): https://pastebin.com/ebmCSZkC

Someone tweeted Fly CEO: https://twitter.com/SouthPawNZ/status/1682181533673857024

[1] https://community.fly.io/t/service-interruption-cant-destroy...

2. DumbStarbucks ◴[] No.36809439[source]
You unfortunately get what you pay for.

AWS is more expensive than God, but I'll be damned if you can't have a throat to choke in less than 10 minutes whenever something like this happens.

replies(4): >>36809597 #>>36809932 #>>36810207 #>>36810334 #
3. dcchambers ◴[] No.36809492[source]
I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.

Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.

https://community.fly.io/t/reliability-its-not-great/11253#s...

replies(2): >>36809640 #>>36809652 #
4. mrcwinn ◴[] No.36809493[source]
Don't get me started with Fly — especially postgres machines. In my experience, a really nice idea with poor support and unreliable infrastructure.
5. sharts ◴[] No.36809526[source]
The thing about product marketing is that it is almost always greatly exaggerated at best and borderline (if not outright) lies at worst.
6. xyzzy_plugh ◴[] No.36809571[source]
I think fly.io is pretty incredible but I can't help but feeling they're doomed to follow in heroku's footsteps (unclear if good or bad). They've built some pretty wild stuff and I can't help but wonder if they're overcooking the ocean instead of just solving problems for their users.

Durable and available storage are all they really need to draw me away from big cloud providers but this combined with their answer to S3 being "use S3 or run minio" means I'll never take them seriously.

This is a bad look folks, not sure how you can walk back days of silence and hiding threads. Just open an issue and talk to your users.

replies(2): >>36809844 #>>36809908 #
7. marban ◴[] No.36809579[source]
They're less humble in communicating other things https://fly.io/blog/we-raised-a-bunch-of-money/
replies(2): >>36809630 #>>36809870 #
8. aledalgrande ◴[] No.36809594[source]
Wondering if for small/bootstrapped projects there's any alternative people suggest? Fly has a nice UX and accessible prices, but it's unstable at best. I use the big clouds at work, but for personal they are $$$. Also I want to keep devops tending asymptotically to zero.
replies(6): >>36809634 #>>36809757 #>>36809923 #>>36809986 #>>36810231 #>>36817346 #
9. orangepurple ◴[] No.36809597[source]
AWS support replies back to your messages when they feel like it. Their support is just as shady but they have better uptime for sure
replies(2): >>36809668 #>>36809933 #
10. mrcwinn ◴[] No.36809630[source]
I could not agree more. When I read this my immediate thought was — all that money, and none spent on product marketing or copywriting. Oof.
11. gowthamgts12 ◴[] No.36809634[source]
Although, i have never used them, you can explore railway.app. it is the closest to fly.io and never heard any bad things.

I personally at the moment use digitalocean without any issues, but there's always the maintenance overhead of managing a server yourself.

replies(2): >>36809768 #>>36812801 #
12. mrcwinn ◴[] No.36809640[source]
Yes, this. It's tough when you've already played your "we messed up but we're making it right" card, and then you continue to not have it right.
replies(1): >>36809864 #
13. spiderice ◴[] No.36809650[source]
There is now a response to the support thread from Fly[1]:

> Hi Folks,

> Just wanted to provide some more details on what happened here, both with the thread and the host issue.

> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.

> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.

> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.

> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.

[1] https://community.fly.io/t/service-interruption-cant-destroy...

replies(10): >>36809693 #>>36809725 #>>36809824 #>>36809928 #>>36810269 #>>36810740 #>>36811025 #>>36812597 #>>36812956 #>>36813681 #
14. jssjr ◴[] No.36809652[source]
I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.

We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.

All this feedback matters. We hear it even when we drop the ball communicating.

replies(1): >>36809791 #
15. erulabs ◴[] No.36809668{3}[source]
FWIW, our aws enterprise support reps are available 24/7 and usually respond within a few minutes.

But again, you get what you pay for.

replies(1): >>36810069 #
16. gowthamgts12 ◴[] No.36809693[source]
> While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally.

If it really got missed, then I don't understand how the thread was made private to only logged-in users?

replies(3): >>36810248 #>>36810251 #>>36810285 #
17. mrcwinn ◴[] No.36809725[source]
For what it’s worth, I left Fly because of this crap. At first my Fly machine web app had intermittent connection issues to a new production PG machine. Then my PG machine died. Hard. I lost all data. A restart didn’t work - it could not recover. I restored an older backup over at RDS and couldn’t be happier I left.
replies(5): >>36809880 #>>36810018 #>>36810039 #>>36810724 #>>36814012 #
18. reustle ◴[] No.36809757[source]
I’m quite happy with https://render.com after leaving Heroku
replies(4): >>36809955 #>>36810014 #>>36810190 #>>36811406 #
19. Fire-Dragon-DoL ◴[] No.36809768{3}[source]
I wish digitalocean offered decent pricing for spaces (s3). Unfortunately it starts at 5$, which is an enormous price for storing 70 small images, but s3 would greatly simplify my server management moving state entirely outside the server (managed database + managed object storage)
replies(2): >>36809782 #>>36809841 #
20. bongobingo1 ◴[] No.36809782{4}[source]
> price for storing 70 small images

Do you have to use an object store in that case? Or does it have to be separate from whatever application instance?

replies(1): >>36810063 #
21. skullone ◴[] No.36809791{3}[source]
What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year. Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.
replies(1): >>36810147 #
22. SadTrombone ◴[] No.36809810[source]
Incredibly unimpressed at fly.io staff for hiding/making private the downtime forum support thread.
23. xx__yy ◴[] No.36809823[source]
According to their Status page it's all resolved: https://status.flyio.net/
replies(1): >>36810093 #
24. bongobingo1 ◴[] No.36809824[source]
Seems like the OP should have made a HN thread in the first place instead of posting to community.stri^H^H^H^Hfly.io
replies(2): >>36811005 #>>36811202 #
25. iampims ◴[] No.36809841{4}[source]
You could use Cloudflare R2, it's pretty cheap overall.
replies(1): >>36810068 #
26. unmole ◴[] No.36809844[source]
> use S3 or run minio

Is using Cloudflare R2 not an option?

replies(1): >>36810289 #
27. pech0rin ◴[] No.36809852[source]
I really want to love Fly.io. It's super easy to get setup and use, but to be honest I don't think anyone should be building mission critical applications on their service. I ended up migrating everything over to AWS (which I reallllly didn't want to do) because:

* Frequent machines not working, random outages, builds not working

* Support wasn't responsive, didn't read my questions (kept asking same questions over and over again) -- I paid for a higher tier specifically for support.

* General lack of features (can't add sidecars, hard to integrate with external monitoring solutions)

* Lack of documentation -- For happy path its good but any edge cases the documentation is really lacking.

Anyway, for hobby projects its fine and nice. I still host a lot of personal projects there. But I have to move my companies infrastructure off of it because it ended up costing us too much time/frustration, etc. I really had high hopes going into it as I had read it was a spiritual successor of sorts to Heroku which was an amazing service in its day, but I don't think its there yet.

replies(5): >>36809906 #>>36810044 #>>36810185 #>>36810377 #>>36815196 #
28. jdkfoo ◴[] No.36809864{3}[source]
Hosting service that cannot get basics right after a decade plus of solving these problems as an industry.

Are we even trying or just repeating ourselves because we don’t know what else to do?

How can the entire industry keep making the same basic errors?

“Let’s keep it simp… ohh nope we invented a Turing complete language and customer service is terri… wait do we have customer service?”

I get the world turning against SaaS lately.

Computers are so fast now, enthusiasts would be better served DIY; put a beige box in a local colo, use one of the big 3 for big business.

This is just starting to look disreputable and disrespectful to humanity itself putting such resources into one time bomb after another.

replies(3): >>36810449 #>>36812099 #>>36813156 #
29. windexh8er ◴[] No.36809870[source]
I tried Fly once, but, at the end of the day it seemed way too expensive for what it was and the completeness of the vision. And then I started to see the complaints in random corners of the Internet.

I don't read their blog regularly but I always thought they had great content. But not after reading this.

The irony: "What people actually wanted to talk about, though? Databases."

...but apparently not when they are the problem behind said databases?

replies(1): >>36810091 #
30. steve_adams_86 ◴[] No.36809880{3}[source]
I left digitalocean for fly because some of their tooling was excellent. I was pretty excited.

I’m back on digitalocean now. I’m not unhappy about it, they’re very solid. I don’t love some things about their services, but overall I’d highly recommend them to other developers.

I gave up on fly because I’d spontaneously be unable to automate deployments due to limited resources. Or I’d have previously happy deployments go missing with no automatic recovery. I didn’t realize this was happening to a number of my services until I started monitoring with 3rd party tools, and it became evident that I really couldn’t rely on them.

It’s a shame because I do like a lot of other things about them. Even for hobby work it didn’t seem worth the trouble. With digitalocean, everything “just works”. There’s no free tier, but the lower end of pricing means I can run several Go apps off of the same droplet for less than the price of a latte. It’s worth the sanity.

replies(4): >>36810127 #>>36810379 #>>36813660 #>>36813890 #
31. steve_adams_86 ◴[] No.36809906[source]
My experience was the same. I stopped using it for hobby projects recently when I had two consecutive days of being unable to build anything. The same stuff that built the week before, built fine locally, then eventually built on fly again — just, inexplicable downtime with no word from support.

Their free tier is very generous. You can get a lot happening and stay under their billing threshold. But, I like to get stuff done. I have a family. I code in my spare time very rarely, and I need a service that’ll let me just build my goddamn project. This was a small static site built by Node, so nothing spectacular happening.

I do wish them the best though. They have an excellent product in their tooling, and if they could stabilize their infrastructure I’d love to try them again.

replies(1): >>36811434 #
32. yowlingcat ◴[] No.36809908[source]
At least I could rely on Heroku in production. I've wanted to give Fly.io a try but this gives me pause. I really do miss the Heroku DX whenever I'm putzing around with the increasing complexity of AWS.
replies(1): >>36810196 #
33. q7xvh97o2pDhNrh ◴[] No.36809923[source]
Maybe just pick up 3 chonky EC2 boxes, set up iptables on each of them, have each one run a containerized version of your code that gets built and deployed from CI every time you push to Github, slap an ALB in front of it all, and call it a day?

And if you need state, then spin up a little RDS with your favorite SQL flavor of choice?

The CI deploy script could even bake in little health-checks so you can do rolling deploys with zero downtime. Depending on how fancy you wanted to get with your shell scripting, you could probably even make 1 of your 3 boxes a canary without too much trouble.

I'm realizing I haven't thought about this in a long time, since nowadays I just get to use the fancy stuff at work. Kind of a fun thought experiment!

replies(1): >>36810004 #
34. emmelaich ◴[] No.36809928[source]
The irony or perhaps the tragedy of building a low friction service is that you have to have experts on the lower level high friction stuff.

I would hope that after a couple of hours downtime, they'd bring up a fresh machine with Ansible or whatever. Hardware or AWS/GCP Vm.

replies(1): >>36810229 #
35. mike_d ◴[] No.36809932[source]
> I'll be damned if you can't have a throat to choke in less than 10 minutes whenever something like this happens

That is a hell of generous description for a person who sits in your Slack instance and responds with "I have escalated to the team internally and am waiting to hear back on confirmation if this is an issue."

Moving a Level 1 support engineer closer to the customer doesn't give them more information, it just reduces the latency to getting a non-answer.

36. sho ◴[] No.36809933{3}[source]
No love for AWS, but this isn't true, at least for larger deploys. If you're running enough with them that you have an account manager, they are very good indeed. You can have someone, someone good, on the phone within minutes and they will stay on the line until the issue is sorted.

I recall an incident at my old company where we were under DDOS, it was getting through cloudflare and saturating LBs in some complicated manner (don't recall the exact details) which made it hard for us to fix ourselves. They were on the phone with us for hours, well past midnight their time, helping us sort it out. The downtime sucked, but I was certainly impressed with their truly excellent support.

37. pmarreck ◴[] No.36809948[source]
Reliability is everything. Why aren’t they monitoring their own machines (real or virtual) and getting fire alarms when there’s an outage?
replies(2): >>36810652 #>>36810746 #
38. iamyatin ◴[] No.36809955{3}[source]
I second render.com. I switched from fly.io to Render.com after seeing a few of my instances getting bottlenecked and crashing. Now the same service runs smoothly on render.com without any crashes. Didn't dig any deeper but somehow the resource management is better with render.com
39. sho ◴[] No.36809986[source]
Honestly these days I am leaning towards this approach: https://github.com/mrsked/mrsk/

It's all just docker.

replies(1): >>36810010 #
40. aledalgrande ◴[] No.36810004{3}[source]
The system you describe is quite the monthly bill, off the top of my head.
replies(2): >>36810166 #>>36810401 #
41. aledalgrande ◴[] No.36810010{3}[source]
Nah I don't wanna be responsible for running a control plane. I just wanna focus on the app, that's all.
replies(1): >>36810298 #
42. aledalgrande ◴[] No.36810014{3}[source]
I'll give them a run thanks!
43. ◴[] No.36810018{3}[source]
44. siquick ◴[] No.36810032[source]
We tried to migrate all our staging environments to Fly last year but it was the flakiest experience I’ve experienced on any PaaS. Pushing simple containers up would fail 70-80% of the time with no useful error messages and non existent support. It’s a weird company that seems great until you actually use them.
45. pier25 ◴[] No.36810039{3}[source]
So you didn't have a HA setup with multiple machines and volumes?
replies(1): >>36810268 #
46. rendaw ◴[] No.36810044[source]
Half the critical info for using their services is buried in some thread in the forum (posted by an employee). How bad is their documentation pipeline that they can't with similar effort get that same info in the documentation? Requests to put stuff in the docs go ignored.

The answer to _any_ usage related forum question should be:

1. It's in the documentation <here> (maybe I just added it)

2. If you're left with any confusion, let me know and I'll update the documentation to resolve it

replies(1): >>36810088 #
47. KingOfCoders ◴[] No.36810059[source]
Sad how this behaviour drags down LiteFS. I don't trust a company to build a database with that kind of culture.
48. Fire-Dragon-DoL ◴[] No.36810063{5}[source]
I don't have to use an object store, but it makes the cost of setting up a server more expensive if I use the filesystem, if I delete the instance, the data is gone. A volume kinda offset this, but it's way less portable and accessible only by one instance at a time

The peace of mind of managed is nice, all I have to think about is running the app, without having to deal with making sure db and files don't get lost

replies(1): >>36810481 #
49. Fire-Dragon-DoL ◴[] No.36810068{5}[source]
I did not realize they have an s3 compatible service
50. eropple ◴[] No.36810069{4}[source]
I was working for a pretty big early AWS customer--one that had realized that for the low low price of all your money you could make DynamoDB scale to some truly massive numbers--and one time when we were having trouble around noon Eastern, a colleague called up our TAM. As he told it, the TAM sounded half-asleep, so my colleague asked if everything was alright.

"I'm in Hawaii on my honeymoon and my backup missed your call, so it escalated."

I probably wouldn't have answered the phone. Granted, that's why I don't do that job. But I have always had a real appreciation for the good TAMs ever since.

replies(2): >>36810163 #>>36810261 #
51. KingOfCoders ◴[] No.36810078[source]
I wanted to give Fly.io a try in my next project but not with this operational culture. I regret telling my CTO clients about Fly.io as the next big thing in operations.
52. ◴[] No.36810088{3}[source]
53. cschmatzler ◴[] No.36810091{3}[source]
Their blog is great, because they invested heavily in perception from the outside. Coming from the Elixir world, them hiring Chris McCord (creator of Phoenix) and sponsoring a ton of open source projects slapping on their logo, seemed great at first, but when it comes to actually deploying stuff to production and day 2 operations (monitoring is so much more difficult than it should be, and troubleshooting tools are lacking) they are way behind. I can imagine them getting lots of hobby projects on board due to free tier and day 1 impression, but that won’t win over enterprises.
54. reustle ◴[] No.36810093[source]
According to the Status page, there was never an issue to begin with
55. danielvaughn ◴[] No.36810127{4}[source]
I adore DO. They’re seriously underrated. I love how they’ll just give you a server and say here, have at it. No abstractions, no fancy crap, just get out of my way and let me do my thing.
replies(10): >>36810554 #>>36810628 #>>36810638 #>>36812302 #>>36813142 #>>36813668 #>>36814283 #>>36823458 #>>36827607 #>>36834710 #
56. justinclift ◴[] No.36810147{4}[source]
Maybe there needs to be a better "burn in" test setup for their new hardware, just to catch mistakes in the build prep and/or catch bad hardware?
replies(1): >>36810305 #
57. justinclift ◴[] No.36810163{5}[source]
Wonder if that marriage lasted though? ;)
58. justinclift ◴[] No.36810166{4}[source]
You can do the same thing using Hetzner dedicated hosts fairly cheaply:

https://www.hetzner.com/dedicated-rootserver/matrix-ax

59. asaddhamani ◴[] No.36810185[source]
Curious to know, have you tried Render? What is the successor to Heroku in your eyes?
replies(3): >>36810504 #>>36811045 #>>36811655 #
60. donutshop ◴[] No.36810187[source]
Well that's not gonna fly.
61. meesterdude ◴[] No.36810190{3}[source]
i've also had success with render.com so far! been running an app & DB for $14/mo for a almost 6 months and it's been solid.
62. danjac ◴[] No.36810196{3}[source]
For hobby projects - where I dare not touch AWS for fear of going bankrupt from a misconfigured service - I found the sweet spot to be Dokku on top of a Hetzner or Digital Ocean instance. It provides a Heroku like interface on top of cheap hosting, and is fine where you don't expect to scale very much.
63. joecool1029 ◴[] No.36810207[source]
I had one situation where a Hetzner dedi didn't come back up on a reboot. Their dedis are cheap, this one is like $40ish/mo?

Opened a ticket and support had it back up again within about 10 minutes, turned out to be a failed CPU fan which caused an overheat condition and made it so the system wouldn't complete the boot. They swapped the fan and it came up. It's the only failure I've had in years of dealing with them and was just impressed how quickly a physical failure event like that got handled.

replies(1): >>36810370 #
64. ps ◴[] No.36810229{3}[source]
> I would hope that after a couple of hours downtime, they'd bring up a fresh machine with Ansible or whatever.

It is not just about a fresh machine which hopefully sits in each datacenter. I can imagine they needed the clone of the system due to the design of the fly.io service and that's where the "fun" begins.

65. danjac ◴[] No.36810231[source]
I use Dokku on top of Hetzner for my hobby projects - hosting is super cheap, for a little extra I can add a mounted volume for storage, and if the project outgrows a single server I can always just break out of Dokku and use some Docker containers behind a load balancer.

If you are outside of Europe, Digital Ocean or Linode may work better for you.

replies(3): >>36810263 #>>36810369 #>>36812847 #
66. ◴[] No.36810248{3}[source]
67. p-e-w ◴[] No.36810251{3}[source]
Whoa, what? That's a much bigger red flag than the downtime itself.
replies(1): >>36810657 #
68. throwaway220033 ◴[] No.36810259[source]
The worst thing about Fly is, when something goes wrong, it's not just one thing, there's bunch of things broken at the same time and their status page will show everything green.

Their typical response is either silence or so casual ("oh this is what happens we deploy on friday"). The product looks amazing but it's just a nice package around the most unreliable hosting service I've ever used.

You can't just keep breaking people's work every once a week, make them spend their weekend nights trying to bring back their stuff, and give these "we could have done better" answers. This is an excuse for exceptions, not patterns.

replies(2): >>36811046 #>>36811783 #
69. silisili ◴[] No.36810261{5}[source]