Reliability: It’s not great

(community.fly.io)

1226 points bishopsmother | 5 comments | 06 Mar 23 17:47 UTC | HN request time: 0s | source

Show context

pyentropy ◴[06 Mar 23 21:47 UTC] No.35048126[source]▶

Almost half of the issues are caused by their use of HashiCorp products.

As someone that has started tons of Consul clusters, analyzed tons of Terraform states, developed providers and wrote a HCL parser, I must say this:

HashiCorp built a brand of consistent design & docs, security, strict configuration, distributed-algos-made-approachable... but at its core, it's a very fragile ecosystem. The only benefit of HashiCorp headaches is that you will quickly learn Golang while reading some obscure github.com/hashicorp/blah/blah/file.go :)

replies(2): >>35048318 #>>35049109 #

tptacek ◴[06 Mar 23 22:01 UTC] No.35048318[source]▶

>>35048126 #

We are asking to HashiCorp products to do things they were not designed to do, in configurations that they don't expect to be deployed in. Take a step back, and the idea of a single global namespace bound up with Raft consistency for a fleet deployed in dozens of regions, providing near-real-time state propagation, is just not at all reasonable. Our state propagation needs are much closer to those of a routing protocol than a distributed key-value database.

I have only positive things to say about every HashiCorp product I've worked with since I got here.

replies(3): >>35048609 #>>35049327 #>>35050286 #

pyentropy ◴[06 Mar 23 22:23 UTC] No.35048609[source]▶

>>35048318 #

I respect that. Can you elaborate a bit on the routing protocol thing? I assume you used WAN gossip?

I love the simplicity of fly.io & wish you all the best improving Fly's reliability!

replies(2): >>35048792 #>>35048795 #

tptacek ◴[06 Mar 23 22:39 UTC] No.35048792[source]▶

>>35048609 #

If you've ever implemented IS-IS or OSPF before, like 80% of the work is "LSP flooding", which is just the process that gets updates about available links from one end of the network to another as fast as possible without drowning the links themselves in update messages. Flooding algorithms don't build consensus, unlike Raft quorums, which intrinsically have a centralized set of authorities that keep a single source of truth for all the valid updates.

An OSPF router uses those updates to do build a forwarding table with a single-point shortest path first routine, but there's nothing to say that you couldn't instead use the same notion of publishing weighted advertisements of connectivity to, for instance, build a table to map incoming HTTP requests to backends that can field them.

The point is, if you're going to do distributed consensus, you've got a dilemma: either you're going to have the Ents moot in a single forest, close together, and round trip updates from across the globe in and out of that forest (painfully slow to get things in and out of the cluster), or you're going to try to have them moot long distance (painfully slow to have the cluster converge). The other thing you can do, though, is just sidestep this: we really don't have the Raft problem at all, in that different hosts on our network do not disagree with each other about whether they're running particular apps; if worker-sfu-ord-1934 says it's running an instance of app-4839, I pretty much don't give a shit if worker-sfu-maa-382a says otherwise; I can just take ORD's word for it.

That's the intuition behind why you'd want to do something like SWIM update propagation rather than Raft for a global state propagation scheme.

But if you're just doing service discovery for a well-bounded set of applications (like you would be if you were running engineering for a single large company and their internal apps), Raft gives you some handy tools you might reasonably take advantage of --- a key-value store, for instance. You're mostly in a single data center anyways, so you don't have the long-distance-Entmoot problem. And HashiCorp's tools will federate out across multiple data centers; the constraints you inherit by doing that federation mostly don't matter for a single company's engineering, but they're extremely painful if you're servicing an unbounded set of customer applications and providing each of them a single global picture of their deployments.

Or we're just holding it wrong. Also a possibility.

replies(1): >>35049938 #

1. dastbe ◴[07 Mar 23 00:35 UTC] No.35049938[source]▶

>>35048792 #

this doesn't paint a full picture of your options, as there's nothing that stops you from having zonal/regional consensus and then replication across regions/long-range topologies for global distribution.

to be pithy about it, going full-bore gossip protocol is like going full-bore blockchain: solves a problem, introduces a lot of much more painful problems, and would've been solved much more neatly with a little bit of centralization.

replies(1): >>35051105 #

2. tptacek ◴[07 Mar 23 03:15 UTC] No.35051105[source]▶

>>35049938 (TP) #

I don't disagree that there are opportunities to introduce topology. I do disagree that there are opportunities to benefit from distributed consensus. If a server in ORD is down, it doesn't matter what some server in SJC says it's hosting; all the ORD instances of all the apps on that server are down. If that same ORD server is up, it doesn't matter what any server says it's running; it's authoritative for what it's running.

Of course, OSPF has topology and aggregation, too.

At any rate: I didn't design the system we're talking about.

replies(1): >>35051629 #

3. dastbe ◴[07 Mar 23 04:38 UTC] No.35051629[source]▶

>>35051105 #

> I do disagree that there are opportunities to benefit from distributed consensus

there's some benefits to static stability and grey failure, but sure, whatever. the important bit is to have clear paths of aggregation and dissemination in your system.

that being said

> it doesn't matter what some server in SJC says it's hosting

it kind of does matter doesn't it? assuming that server in SJC is your forwarding proxy that does your global loadbalancing, what that server is aware of is highly relevant to what global actions you can take safely.

replies(1): >>35051718 #

4. tptacek ◴[07 Mar 23 04:52 UTC] No.35051718{3}[source]▶

>>35051629 #

My point is just that there isn't a consensus algorithm that needs to get run to know which of the two proposals to accept.

replies(1): >>35052147 #

5. injinj ◴[07 Mar 23 06:09 UTC] No.35052147{4}[source]▶

>>35051718 #

It doesn't need a raft consensus algorithm, but corrosion does converge to a consensus, doesn't it? In the OSPF example, that does needs to converge to a state that is consistent and replicated on all the routers, otherwise loops and drops will occur. I'm curious if any convergence benchmark has been done that compares raft to corrosion.

↑