←back to thread

169 points hunvreus | 3 comments | | HN request time: 0.719s | source
Show context
mystraline ◴[] No.43654294[source]
Different proposal:

Let's say we have 2 Linux machines. Identical hardware, identical libs.

I'd like to run a simple program on one machine, and then during mid-calculation, would like to transfer the running program to the other machine.

Is this doable?

replies(5): >>43654408 #>>43654455 #>>43654466 #>>43654749 #>>43655094 #
1. toast0 ◴[] No.43654749[source]
A search for 'linux process live migration' picks up at least one repo that claims to have done it, and a bunch of other interesrting things.

For a very simple program, with limited I/O, it's not too hard; especially if you don't mind a significant pause to move. Difficulty comes when you have FDs to migrate and if you need to reduce the pausing. If you need to keep FDs to the filesystem or the program will load/store to the filesystem periodically, you'd need to do a filesystem migration too... If you need to keep FDs for network sockets, you've got to transfer those somehow.

If it's just stdin/out/err, you could probably do the migration in userspace with some difficulty if you need to keep pid constant (but maybe you don't need that either).

Minimal pausing involves letting the program run on the initial machine while you copy memory, setting pages to read-only so you can catch writes, and only pausing the program once the copy is substantially finished. Then you pause execution on the initial machine. If there's a significant amount of modified pages to copy over when you pause, you can still start execution on the new machine, as long as the modified pages are marked unavailable, if you background copy them before they're used great... if not, you have to block until the modified data comes through.

Probably you do this on two nearby machines with fast networking, and the program doesn't have a lot of writes all over memory, so the pause should be short.

replies(2): >>43655532 #>>43656452 #
2. dilyevsky ◴[] No.43655532[source]
If you're talking about Criu then it's not just a claim it actually does work well in production. I know Google was using it in prod on their internal systems and probably many others. It even can migrate TCP connections for you via socket repair api in Linux
3. wang_li ◴[] No.43656452[source]
>...keep FDs for network sockets, you've got to transfer those somehow.

And if you have any shared memory segments, semaphores, or message queues, you have to drag along a bunch of other processes.