Using Erlang hot code updates

(underjord.io)

Show context

jhgg ◴[19 Nov 24 23:43 UTC] No.42189283[source]▶

When I worked at Discord, we used BEAM hot code loading pretty extensively, built a bunch of tooling around it to apply and track hot-patches to nodes (which in turn could update the code on >100M processes in the system.) It allowed us to deploy hot-fixes in minutes (full tilt deploy could complete in a matter of seconds) to our stateful real-time system, rather than the usual ~hour long deploy cycle. We generally only used it for "emergency" updates though.

The tooling would let us patch multiple modules at a time, which basically wrapped `:rpc.call/4` and `Code.eval_string/1` to propagate the update across the cluster, which is to say, the hot-patch was entirely deployed over erlang's built-in distribution.

replies(2): >>42189462 #>>42191479 #

stouset ◴[20 Nov 24 07:17 UTC] No.42191479[source]▶

>>42189283 #

Can someone explain how this is not genuinely terrifying from a security perspective?

replies(3): >>42191535 #>>42191565 #>>42192955 #

1. nelsonic ◴[20 Nov 24 07:26 UTC] No.42191535[source]▶

>>42191479 #

Where is the security problem? All code commits and builds can still be signed. All of this is just a more efficient way of deploying changes without dropping existing connections.

Are you suggesting that hot code replacement is somehow a attack vector? Ericsson has been using this method for decades on critical infrastructure to patch switches without dropping live calls/connections it works.

No need to fear Erlang/BEAM.

replies(1): >>42191567 #

2. stouset ◴[20 Nov 24 07:33 UTC] No.42191567[source]▶

>>42191535 (TP) #

My interpretation of the GP was that a code change in one node can be automagically propagated out to a cluster of participating Erlang nodes.

As a security person, this seems inherently dangerous. I asked why it is safe, because I presumed I’m missing something due to the lack of ever hearing about exploitation in the wild.

replies(2): >>42192109 #>>42197220 #

3. badpenny ◴[20 Nov 24 09:18 UTC] No.42192109[source]▶

>>42191567 #

Why is it any more dangerous than a conventional update, which also needs to be propagated?

replies(1): >>42195485 #

4. stouset ◴[20 Nov 24 16:25 UTC] No.42195485{3}[source]▶

>>42192109 #

A conventional update takes place out of band.

If someone were to exploit a running Erlang process, the description of this feature sounds to me like they would have access to code paths that allow pushing new code to other Erlang processes on cooperating nodes.

replies(1): >>42200275 #

5. toast0 ◴[20 Nov 24 19:22 UTC] No.42197220[source]▶

>>42191567 #

An Erlang dist cluster has no barriers between connected nodes. But a multithreaded application has no barriers between its threads either.

If someone can exploit one Erlang node, they can easily take over the cluster. But in a more typical horizontally scaled system, usually if they can get into one node, they can get into all the other nodes running the same software the same way.

Security wise, I think of the whole cluster as one unit. There's no meaningful way to separate it, so it's just one thing. Best not to let anyone in who can't be trusted, because either they have access or they don't; there's no limited access.

But given that, may as well push code updates over dist in a straight forward way, because it's possible, so it may as well be straight forward.

6. vermilingua ◴[21 Nov 24 01:56 UTC] No.42200275{4}[source]▶

>>42195485 #

Yes, but if they can exploit one process they can exploit any of the other nodes anyway, so there's nothing to be gained but a bit of convenience.

↑