Most active commenters
  • tptacek(5)
  • otterley(5)
  • mixmastamyk(3)

←back to thread

1226 points bishopsmother | 24 comments | | HN request time: 0.001s | source | bottom
Show context
pyentropy ◴[] No.35048126[source]
Almost half of the issues are caused by their use of HashiCorp products.

As someone that has started tons of Consul clusters, analyzed tons of Terraform states, developed providers and wrote a HCL parser, I must say this:

HashiCorp built a brand of consistent design & docs, security, strict configuration, distributed-algos-made-approachable... but at its core, it's a very fragile ecosystem. The only benefit of HashiCorp headaches is that you will quickly learn Golang while reading some obscure github.com/hashicorp/blah/blah/file.go :)

replies(2): >>35048318 #>>35049109 #
1. tptacek ◴[] No.35048318[source]
We are asking to HashiCorp products to do things they were not designed to do, in configurations that they don't expect to be deployed in. Take a step back, and the idea of a single global namespace bound up with Raft consistency for a fleet deployed in dozens of regions, providing near-real-time state propagation, is just not at all reasonable. Our state propagation needs are much closer to those of a routing protocol than a distributed key-value database.

I have only positive things to say about every HashiCorp product I've worked with since I got here.

replies(3): >>35048609 #>>35049327 #>>35050286 #
2. pyentropy ◴[] No.35048609[source]
I respect that. Can you elaborate a bit on the routing protocol thing? I assume you used WAN gossip?

I love the simplicity of fly.io & wish you all the best improving Fly's reliability!

replies(2): >>35048792 #>>35048795 #
3. tptacek ◴[] No.35048792[source]
If you've ever implemented IS-IS or OSPF before, like 80% of the work is "LSP flooding", which is just the process that gets updates about available links from one end of the network to another as fast as possible without drowning the links themselves in update messages. Flooding algorithms don't build consensus, unlike Raft quorums, which intrinsically have a centralized set of authorities that keep a single source of truth for all the valid updates.

An OSPF router uses those updates to do build a forwarding table with a single-point shortest path first routine, but there's nothing to say that you couldn't instead use the same notion of publishing weighted advertisements of connectivity to, for instance, build a table to map incoming HTTP requests to backends that can field them.

The point is, if you're going to do distributed consensus, you've got a dilemma: either you're going to have the Ents moot in a single forest, close together, and round trip updates from across the globe in and out of that forest (painfully slow to get things in and out of the cluster), or you're going to try to have them moot long distance (painfully slow to have the cluster converge). The other thing you can do, though, is just sidestep this: we really don't have the Raft problem at all, in that different hosts on our network do not disagree with each other about whether they're running particular apps; if worker-sfu-ord-1934 says it's running an instance of app-4839, I pretty much don't give a shit if worker-sfu-maa-382a says otherwise; I can just take ORD's word for it.

That's the intuition behind why you'd want to do something like SWIM update propagation rather than Raft for a global state propagation scheme.

But if you're just doing service discovery for a well-bounded set of applications (like you would be if you were running engineering for a single large company and their internal apps), Raft gives you some handy tools you might reasonably take advantage of --- a key-value store, for instance. You're mostly in a single data center anyways, so you don't have the long-distance-Entmoot problem. And HashiCorp's tools will federate out across multiple data centers; the constraints you inherit by doing that federation mostly don't matter for a single company's engineering, but they're extremely painful if you're servicing an unbounded set of customer applications and providing each of them a single global picture of their deployments.

Or we're just holding it wrong. Also a possibility.

replies(1): >>35049938 #
4. ◴[] No.35048795[source]
5. otterley ◴[] No.35049327[source]
Well, why did you do that? If you’d asked them whether this was a supported configuration or intended purpose, they’d have said no; and anyone who had experience deploying Consul at large scale would have told you the same.

There is truly no compression algorithm for experience.

replies(2): >>35049708 #>>35055005 #
6. mixmastamyk ◴[] No.35049708[source]
I don't think he personally designed the first implementation. But in any case, understanding of complex topics comes in waves.

Many times I've had to read all the docs then use a system for several months before the epiphany hits me.

replies(2): >>35052309 #>>35055742 #
7. dastbe ◴[] No.35049938{3}[source]
this doesn't paint a full picture of your options, as there's nothing that stops you from having zonal/regional consensus and then replication across regions/long-range topologies for global distribution.

to be pithy about it, going full-bore gossip protocol is like going full-bore blockchain: solves a problem, introduces a lot of much more painful problems, and would've been solved much more neatly with a little bit of centralization.

replies(1): >>35051105 #
8. pcthrowaway ◴[] No.35050286[source]
Are there any plans to make Corrosion open source? Or are you able to talk at all about the technologies/patterns used to create it? I feel like service discovery is still ripe for disruption
replies(1): >>35050290 #
9. tptacek ◴[] No.35050290[source]
Yeah, we'll for sure talk about it more some other time. Mostly today we want to talk about how we were sucking ass at customer comms.
replies(2): >>35052066 #>>35052539 #
10. tptacek ◴[] No.35051105{4}[source]
I don't disagree that there are opportunities to introduce topology. I do disagree that there are opportunities to benefit from distributed consensus. If a server in ORD is down, it doesn't matter what some server in SJC says it's hosting; all the ORD instances of all the apps on that server are down. If that same ORD server is up, it doesn't matter what any server says it's running; it's authoritative for what it's running.

Of course, OSPF has topology and aggregation, too.

At any rate: I didn't design the system we're talking about.

replies(1): >>35051629 #
11. dastbe ◴[] No.35051629{5}[source]
> I do disagree that there are opportunities to benefit from distributed consensus

there's some benefits to static stability and grey failure, but sure, whatever. the important bit is to have clear paths of aggregation and dissemination in your system.

that being said

> it doesn't matter what some server in SJC says it's hosting

it kind of does matter doesn't it? assuming that server in SJC is your forwarding proxy that does your global loadbalancing, what that server is aware of is highly relevant to what global actions you can take safely.

replies(1): >>35051718 #
12. tptacek ◴[] No.35051718{6}[source]
My point is just that there isn't a consensus algorithm that needs to get run to know which of the two proposals to accept.
replies(1): >>35052147 #
13. pcthrowaway ◴[] No.35052066{3}[source]
Definitely looking forward to it!
14. injinj ◴[] No.35052147{7}[source]
It doesn't need a raft consensus algorithm, but corrosion does converge to a consensus, doesn't it? In the OSPF example, that does needs to converge to a state that is consistent and replicated on all the routers, otherwise loops and drops will occur. I'm curious if any convergence benchmark has been done that compares raft to corrosion.
15. otterley ◴[] No.35052309{3}[source]
I also think there’s this tendency in the industry to want to solve problems on your own without the help from outsiders, even if they know the problem space better than you do, and even if they’d gladly help (often for free) if asked. It’s especially worrisome when it’s powering a key workload that is essential to the functioning of your business. Sometimes it’s because you might not know whom to consult or recruit, but in this case, the vendor was known.
16. wferrell ◴[] No.35052539{3}[source]
Can you please write a blog post or book with "Sucking ass at customer comms" as the title ;)

I say this as someone that loves fly :)

17. bovermyer ◴[] No.35055005[source]
This feels unnecessarily antagonistic. "If you were experienced, you would have made the right decision, _obviously_."

Did Fly.io kick your puppy or something?

replies(1): >>35055735 #
18. otterley ◴[] No.35055735{3}[source]
I can see how it would be interpreted that way, and I apologize if it came across that way, but it wasn’t my intent. See my other comment below. What I’m really saying is that we need to be better about engaging subject matter experts early on when we are selecting technologies to power core business functions; and I think it’s a good illustration of why we need to continue to hire experienced people at startups.
replies(1): >>35061294 #
19. JeremyNT ◴[] No.35055742{3}[source]
This is especially true for scaling. A solution that works great for your current deployment may be completely unworkable for 2x your current deployment.

You just won't know until you fall off the cliff. The armchair quarterback can opine that you should have just hired experts in XYZ domains from the start to design robust systems that can scale to arbitrary sizes, but most orgs don't need to scale to arbitrary sizes so this is highly likely to be wasted effort.

replies(1): >>35055829 #
20. otterley ◴[] No.35055829{4}[source]
While I largely agree with you, this isn’t one of those cases. If Fly wasn’t supposed to scale in due course to this size, it probably wouldn’t have been funded. If your business model is predicated on you scaling, yes, you should hire appropriately in anticipation of that.

Besides, I’m not even necessarily talking about hiring here - even consulting would have been sufficient to avoid this catastrophe.

replies(1): >>35073725 #
21. bovermyer ◴[] No.35061294{4}[source]
That's a fair point... but at the same time, we shouldn't hold off on starting something just because we don't have perfect information.
22. mixmastamyk ◴[] No.35073725{5}[source]
Yes, although it's rarely possible to know which bottlenecks will hurt the most up front. Unless you've done the same thing before, which is not the case with anyone pushing boundaries.

Basically this is an argument around so-called premature optimization. Good to have issues now while it is mostly enthusiasts that are the customers. Guessing that this bump will be forgotten in five years? And not like AWS et al don't have outages occasionally that they learn from.

replies(1): >>35110742 #
23. otterley ◴[] No.35110742{6}[source]
Consul has been around for close to 9 years now, and people have in fact tried to use Consul in the very same way Fly did, in many different business and industries, with similarly failing outcomes. Hashicorp knows this and almost certainly would have counseled against it if asked.
replies(1): >>35139705 #
24. mixmastamyk ◴[] No.35139705{7}[source]
Insert Donald Rumsfeld quote about un/known un/knowns.