Fly.io Postgres cluster down for 3 days, no word from them about it

(webcache.googleusercontent.com)

Show context

dcchambers ◴[21 Jul 23 02:54 UTC] No.36809492[source]▶

I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.

Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.

https://community.fly.io/t/reliability-its-not-great/11253#s...

replies(2): >>36809640 #>>36809652 #

jssjr ◴[21 Jul 23 03:25 UTC] No.36809652[source]▶

>>36809492 #

I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.

We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.

All this feedback matters. We hear it even when we drop the ball communicating.

replies(1): >>36809791 #

skullone ◴[21 Jul 23 03:52 UTC] No.36809791[source]▶

>>36809652 #

What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year. Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.

replies(1): >>36810147 #

justinclift ◴[21 Jul 23 04:56 UTC] No.36810147{3}[source]▶

>>36809791 #

Maybe there needs to be a better "burn in" test setup for their new hardware, just to catch mistakes in the build prep and/or catch bad hardware?

replies(1): >>36810305 #

1. skullone ◴[21 Jul 23 05:27 UTC] No.36810305{4}[source]▶

>>36810147 #

Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc. And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons. And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.

replies(2): >>36810358 #>>36810561 #

2. timc3 ◴[21 Jul 23 05:39 UTC] No.36810358[source]▶

>>36810305 (TP) #

Could you expand your answer to list vendors which you would recommend?

replies(1): >>36810645 #

3. justinclift ◴[21 Jul 23 06:14 UTC] No.36810561[source]▶

>>36810305 (TP) #

Good point. :)

I'm still wondering about their hardware acceptance/qualification though, prior to it being deployed. ;)

replies(1): >>36810672 #

4. skullone ◴[21 Jul 23 06:30 UTC] No.36810645[source]▶

>>36810358 #

"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales. Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.

replies(1): >>36811260 #

5. skullone ◴[21 Jul 23 06:33 UTC] No.36810672[source]▶

>>36810561 #

Yah presumably they put stuff through it's paces and give everything good fit and finish before running workloads. But failures do happen either way

6. justinclift ◴[21 Jul 23 08:04 UTC] No.36811260{3}[source]▶

>>36810645 #

Have you come across Fujitsu PRIMERGY servers before?

https://www.fujitsu.com/global/products/computing/servers/pr...

I used to use them a few years ago in a local data centre, and they were pretty good back then.

They don't seem to be widely known about though.

replies(1): >>36817359 #

7. skullone ◴[21 Jul 23 18:12 UTC] No.36817359{4}[source]▶

>>36811260 #

Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments. You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS). We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.

replies(1): >>36824904 #

8. justinclift ◴[22 Jul 23 09:57 UTC] No.36824904{5}[source]▶

>>36817359 #

> You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc.

That does sound pretty useful.

So for yourselves, you rack them then run hardware qualification tests?

↑