Fly.io Postgres cluster down for 3 days, no word from them about it

1. dcchambers ◴[21 Jul 23 02:54 UTC] No.36809492[source]▶

I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.

Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.

https://community.fly.io/t/reliability-its-not-great/11253#s...

replies(2): >>36809640 #>>36809652 #

2. mrcwinn ◴[21 Jul 23 03:23 UTC] No.36809640[source]▶

>>36809492 (TP) #

Yes, this. It's tough when you've already played your "we messed up but we're making it right" card, and then you continue to not have it right.

replies(1): >>36809864 #

3. jssjr ◴[21 Jul 23 03:25 UTC] No.36809652[source]▶

>>36809492 (TP) #

I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.

We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.

All this feedback matters. We hear it even when we drop the ball communicating.

replies(1): >>36809791 #

4. skullone ◴[21 Jul 23 03:52 UTC] No.36809791[source]▶

>>36809652 #

What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year. Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.

replies(1): >>36810147 #

5. jdkfoo ◴[21 Jul 23 04:04 UTC] No.36809864[source]▶

>>36809640 #

Hosting service that cannot get basics right after a decade plus of solving these problems as an industry.

Are we even trying or just repeating ourselves because we don’t know what else to do?

How can the entire industry keep making the same basic errors?

“Let’s keep it simp… ohh nope we invented a Turing complete language and customer service is terri… wait do we have customer service?”

I get the world turning against SaaS lately.

Computers are so fast now, enthusiasts would be better served DIY; put a beige box in a local colo, use one of the big 3 for big business.

This is just starting to look disreputable and disrespectful to humanity itself putting such resources into one time bomb after another.

replies(3): >>36810449 #>>36812099 #>>36813156 #

6. justinclift ◴[21 Jul 23 04:56 UTC] No.36810147{3}[source]▶

>>36809791 #

Maybe there needs to be a better "burn in" test setup for their new hardware, just to catch mistakes in the build prep and/or catch bad hardware?

replies(1): >>36810305 #

7. skullone ◴[21 Jul 23 05:27 UTC] No.36810305{4}[source]▶

>>36810147 #

Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc. And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons. And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.

replies(2): >>36810358 #>>36810561 #

8. timc3 ◴[21 Jul 23 05:39 UTC] No.36810358{5}[source]▶

>>36810305 #

Could you expand your answer to list vendors which you would recommend?

replies(1): >>36810645 #

9. tetha ◴[21 Jul 23 05:55 UTC] No.36810449{3}[source]▶

>>36809864 #

The thing is, running a good SaaS service requires quite a bit of staff and hard operational skills and a lot of manpower. You know, the kinda stuff people always call useless, zero-value add, blockers and entirely to automate.

Sure, we have most of the day-to-day grunt work for our applications automated. But good operations is just more. It's more about maintaining control over your infrastructure at one hand, and making sure your customers feel informed and safe about their data and systems. This is hard and takes lots of experience to do well, as well as manpower.

And yes, that's entirely a soft skill. You end up with questions such as: Should we elevate this issue to an outage on the status page? To a degree you'd be scaring other customers. "Oh no, yellow status page. Something terrible must happen!". At the same time you're communicating to the affected customers just how serious you're taking their issues. "It's a thing on the status page after an initial misjudgement - sorry for that." We have many discussions ilke that during degradations and outages.

replies(1): >>36810508 #

10. jdkfoo ◴[21 Jul 23 06:05 UTC] No.36810508{4}[source]▶

>>36810449 #

Patronizing to assume this is obscure wisdom at this juncture.

Scared customers seems a bit… puerile? In a Sunday school way? Are we not adults capable of rational discourse?

“Why is line not go up!!” still? Just continues to smell like busy work in deference to a politically mandated hallucination.

11. justinclift ◴[21 Jul 23 06:14 UTC] No.36810561{5}[source]▶

>>36810305 #

Good point. :)

I'm still wondering about their hardware acceptance/qualification though, prior to it being deployed. ;)

replies(1): >>36810672 #

12. skullone ◴[21 Jul 23 06:30 UTC] No.36810645{6}[source]▶

>>36810358 #

"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales. Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.

replies(1): >>36811260 #

13. skullone ◴[21 Jul 23 06:33 UTC] No.36810672{6}[source]▶

>>36810561 #

Yah presumably they put stuff through it's paces and give everything good fit and finish before running workloads. But failures do happen either way

14. justinclift ◴[21 Jul 23 08:04 UTC] No.36811260{7}[source]▶

>>36810645 #

Have you come across Fujitsu PRIMERGY servers before?

https://www.fujitsu.com/global/products/computing/servers/pr...

I used to use them a few years ago in a local data centre, and they were pretty good back then.

They don't seem to be widely known about though.

replies(1): >>36817359 #

15. FridgeSeal ◴[21 Jul 23 10:18 UTC] No.36812099{3}[source]▶

>>36809864 #

> Computers are so fast now,

Agreed

> enthusiasts would be better served DIY; put a beige box in a local colo

I mean, like, can I provision a zero ops bit of compute from <mystery colo provider> for $20/month?

Edit: looked up colo providers in my city- “get started in 24 hours, pick a rack and amperage, schedule a call now.”. Yeaaah, no. This is why people use cloud providers instead.

16. api ◴[21 Jul 23 12:48 UTC] No.36813156{3}[source]▶

>>36809864 #

I just got gigabit bidirectional fiber at home and honestly if I were doing personal stuff or doing very early bootstrapping I'd just host from here with a good UPS. No it wouldn't be data center reliability but it'd work at least until it was ready to put in something more resilient.

You can pay for a business class fiber link too. It's about twice as expensive but they have guaranteed outage response times which is really what you pay for.

17. skullone ◴[21 Jul 23 18:12 UTC] No.36817359{8}[source]▶

>>36811260 #

Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments. You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS). We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.

replies(1): >>36824904 #

18. justinclift ◴[22 Jul 23 09:57 UTC] No.36824904{9}[source]▶

>>36817359 #

> You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc.

That does sound pretty useful.

So for yourselves, you rack them then run hardware qualification tests?