←back to thread

267 points lawik | 1 comments | | HN request time: 0.2s | source
Show context
toast0 ◴[] No.42188480[source]
> Both have described hot code updates as something that people should learn and use. I imagine Whatsapp’s initial engineering crew would agree. They did pretty well.

Yeah. Hot loading is clearly better than anything else when you've got a million clients connected and you want to make a code change. Of course, we didn't have any of these fancy 'release' tools, we just used GNU Make to rsync the code to prod and run erlc. Then you can grab a debug shell and l(module). (we did write utilities to see what code was modified, and to provide the right incantations so we wouldn't load if it would kill processes)

replies(1): >>42188817 #
rybosome ◴[] No.42188817[source]
> Hot loading is clearly better than anything else when you've got a million clients connected and you want to make a code change.

In the contexts in which I’ve worked, this was solved by issuing a command to the server to enter a lame-duck mode and stop accepting new connections, then restarting the process with updated code after all existing connections ended.

This worked in our case because connections had a TTL with a “reasonable” time, couldn’t have been more than an hour. We could always wait it out.

I suppose hot reloading is more necessary when you have connections without a set TTL.

replies(1): >>42189144 #
1. toast0 ◴[] No.42189144[source]
That way works, but it means you're spending that much more time on a deploy.

For a small change, you can hot load your change and be done in minutes. This means you can push several small changes in an hour. Which helps get things done rapidly.

It's also really nice to be able to see that the change works as expected (or not) at full load right away. If you've got to wait for connections to accumulate on the new server, that takes longer without hot load too.

Some changes can't be effectively hot loaded[1], and for those you do need to do something to kick out users and let them reconnect elsewhere, and you could do all your updates that way, but it means a lot more client time spent reconnecting.

On the one hour TTL. Sometimes that's reasonable, but sometimes it's really not. Someone downloading a large file on a slow connection is better served by letting the download continue to trickle for hours than forcing them to reconnect and resume. A real time call is better served by letting it run until the participants are done. For someone on a low end phone, staying connected for as long as they can is probably better than forcing a reconnect where they'll need to generate new ephemeral keys and do a key exchange exercise.

[1] At the very least, BEAM updates and kernel changes are much more easily done by restarting. But not all userspace Erlang changes are easy to make hot loadable, either.