Reliability: It’s not great

1. throwawaaarrgh ◴[07 Mar 23 04:27 UTC] No.35051550[source]▶

I've been doing reliability stuff for near two decades. The one thing I am sure of is there is no way to just engineer your way to reliability. That is to say, no person, no matter how smart, can just invent some whizbang engineering thing and suddenly you have reliability.

Reliability is a thing that grows, like a plant. You start out with a new system or piece of software. It's fragile, small, weak. It is threatened by competing things and literal bugs and weather and the soil it's grown in and more. It needs constant care. Over time it grows stronger, and can eventually fend for itself pretty well. Sometimes you get lucky and it just grows fine by itself. And sometimes 50 different things conspire to kill it. But you have to be there monitoring it, finding the problems, learning how to prevent them. Every garden is a little different.

It doesn't matter what a company like Fly does technology wise. It takes time and care and churning. Eventually they will be reliable. But the initial process takes a while. And every new piece of tech they throw in is another plant in the garden.

So the good news is, they can become really reliable. But the bad news is, it doesn't come fast, and the more new plants they put in the ground, the more concerns there are to address before the garden is self sustaining.

replies(7): >>35051647 #>>35052736 #>>35052993 #>>35053029 #>>35053323 #>>35056046 #>>35056972 #

2. mrkurt ◴[07 Mar 23 04:41 UTC] No.35051647[source]▶

>>35051550 (TP) #

This is an excellent description.

3. IshKebab ◴[07 Mar 23 07:54 UTC] No.35052736[source]▶

>>35051550 (TP) #

Maybe so but there are definitely technology choices that have vastly different "initial reliability". What would you expect to be more reliable - a bash script or a Rust program?

I'm not that familiar with setting up global network infrastructure but I imagine there are similar choices that can vastly affect initial reliability.

replies(1): >>35052975 #

4. capableweb ◴[07 Mar 23 08:27 UTC] No.35052975[source]▶

>>35052736 #

I'd say it depends more on the person rather than the technology.

A master in bash will build more reliable API (in bash no less!) than a beginner in Rust, simply because of experience and knowing their way around the tools they're using. Newer/different technologies won't simply solve a problem unless the person has some sort of domain knowledge of said problem.

replies(1): >>35054704 #

5. zamnos ◴[07 Mar 23 08:29 UTC] No.35052993[source]▶

>>35051550 (TP) #

There are no silver bullets for whole system reliability, but high-availability clustered databases was this wiz bang thing that greatly improved the reliability of your database, back in the day. It didn't come cheap, and there were growing pains, but sometimes the available technology does make a difference.

6. unxdfa ◴[07 Mar 23 08:33 UTC] No.35053029[source]▶

>>35051550 (TP) #

You can make sensible assumptions that result in engineering gains though. Step around the problems not through them.

For example I have learned that the first step to reliability is removing as many hashicorp products from your stack as possible though. Appears I am not the only one.

replies(1): >>35055780 #

7. TheDong ◴[07 Mar 23 09:20 UTC] No.35053323[source]▶

>>35051550 (TP) #

> The one thing I am sure of is there is no way to just engineer your way to reliability. That is to say, no person, no matter how smart, can just invent some whizbang engineering thing and suddenly you have reliability.

It's seems true for fly's problem space, but in many problem spaces there really are easy engineering solutions to reliability problems.

For a very easy example, I once worked on a rails app that crashed frequently and managed 5 req/s at best. It turns out the app only loaded static data from hardcoded json files on disk and templated that into stuff. In other words, it was a static site. Replacing it with an actual static site + nginx and a cdn instantly fixed all reliability issues for that website forever, and made it easier to maintain the content to boot.

replies(1): >>35053748 #

8. machinawhite ◴[07 Mar 23 10:28 UTC] No.35053748[source]▶

>>35053323 #

I'm actually surprised such a simple app would have such bad performance and crash at all?

replies(1): >>35063977 #

9. throwawaaarrgh ◴[07 Mar 23 12:48 UTC] No.35054704{3}[source]▶

>>35052975 #

I agree. Like, you could build two houses: one out of sticks, the other brick. Depending on who's building it, if they've not built a house out of brick before, it's very easy to make a mistake. Versus someone who's been building stick houses forever will get it right the first time.

Also, brand new software in general is like a new hybrid plant. How does it behave in this environment compared to other plants? Does it attract more bugs? Does it need different care? We don't know yet; it's new.

And even for an old well known plant, if the gardener hasn't gotten to know it yet, it's easy to make a mistake with its care. But a well known plant with a gardener who's grown it before is the most likely to work without issue.

replies(1): >>35061281 #

10. jen20 ◴[07 Mar 23 14:36 UTC] No.35055780[source]▶

>>35053029 #

If you’ve been using them in ways clearly explicitly called out as not per the design goals, then sure, removing any piece of technology will help you. I’m guessing that is not your actual problem though.

replies(1): >>35059894 #

11. bennetthi ◴[07 Mar 23 14:59 UTC] No.35056046[source]▶

>>35051550 (TP) #

Nicely said. I remember AWS outages (S3, EBS, and RDS) in the early 2010s when their products were younger. But given time to improve each has become more and more resilient.

12. HPsquared ◴[07 Mar 23 16:09 UTC] No.35056972[source]▶

>>35051550 (TP) #

Good engineering helps reliability but doesn't guarantee it.

Bad engineering causes bad reliability.

13. unxdfa ◴[07 Mar 23 19:22 UTC] No.35059894{3}[source]▶

>>35055780 #

I would not assume that Hashicorp products necessarily meet the design goals if I'm honest. Consul and vagrant have been absolute shits and vault adds more complexity and unreliability to the problem domain and has a net negative ROI. I like the idea of their products but the reality is very different.

14. IshKebab ◴[07 Mar 23 21:11 UTC] No.35061281{4}[source]▶

>>35054704 #

> Depending on who's building it, if they've not built a house out of brick before, it's very easy to make a mistake. Versus someone who's been building stick houses forever will get it right the first time.

Decent analogy. And of course if you have people of vaguely similar skill levels then the brick house is going to be way more robust. Which was my point.

15. TheDong ◴[08 Mar 23 01:28 UTC] No.35063977{3}[source]▶

>>35053748 #

I don't think the fact that it did effectively:

    data_1 = `cat ./data1.json | grep "city" | awk ....`
    data_2 = `cat ./data2.json | grep "city" | awk ....`

was exactly helping it to perform well. I'm sure rewriting the rails app to load all the data at startup, not to read each file via several hundred subshells on each request, would have made it perform well enough.

However, pretty much no matter how well or poorly the rails site is built, a static site will be easier to run reliably.