Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

797 points burnerbob | 2 comments | 20 Jul 23 23:42 UTC | HN request time: 0.413s | source

Show context

dcchambers ◴[21 Jul 23 02:54 UTC] No.36809492[source]▶

I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.

Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.

https://community.fly.io/t/reliability-its-not-great/11253#s...

replies(2): >>36809640 #>>36809652 #

jssjr ◴[21 Jul 23 03:25 UTC] No.36809652[source]▶

>>36809492 #

I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.

We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.

All this feedback matters. We hear it even when we drop the ball communicating.

replies(1): >>36809791 #

skullone ◴[21 Jul 23 03:52 UTC] No.36809791[source]▶

>>36809652 #

What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year. Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.

replies(1): >>36810147 #

justinclift ◴[21 Jul 23 04:56 UTC] No.36810147[source]▶

>>36809791 #

Maybe there needs to be a better "burn in" test setup for their new hardware, just to catch mistakes in the build prep and/or catch bad hardware?

replies(1): >>36810305 #

skullone ◴[21 Jul 23 05:27 UTC] No.36810305[source]▶

>>36810147 #

Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc. And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons. And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.

replies(2): >>36810358 #>>36810561 #

1. justinclift ◴[21 Jul 23 06:14 UTC] No.36810561[source]▶

>>36810305 #

Good point. :)

I'm still wondering about their hardware acceptance/qualification though, prior to it being deployed. ;)

replies(1): >>36810672 #

2. skullone ◴[21 Jul 23 06:33 UTC] No.36810672[source]▶

>>36810561 (TP) #

Yah presumably they put stuff through it's paces and give everything good fit and finish before running workloads. But failures do happen either way

↑