Let's say we have 2 Linux machines. Identical hardware, identical libs.
I'd like to run a simple program on one machine, and then during mid-calculation, would like to transfer the running program to the other machine.
Is this doable?
Let's say we have 2 Linux machines. Identical hardware, identical libs.
I'd like to run a simple program on one machine, and then during mid-calculation, would like to transfer the running program to the other machine.
Is this doable?
Instead of RAM, program's state saved in DB and execution environment resume in the previous state when restarted
1. Halt the process (SIGSTOP comes to mind)
2. Create a copy of the running program and /proc/$pid - which will also include memory and mmap details
3. Transfer everything to the other machine
4. Load memory, somehow spawn a spawn a new process with the info from /proc/$pid we saved, mmap the loaded memory into it
5. Continue the process on the new machine (SIGCONT)
Let me admit that I do not have the slightest clue how to achieve step 4. I wonder if a systemd namespace could make things easier.
For a very simple program, with limited I/O, it's not too hard; especially if you don't mind a significant pause to move. Difficulty comes when you have FDs to migrate and if you need to reduce the pausing. If you need to keep FDs to the filesystem or the program will load/store to the filesystem periodically, you'd need to do a filesystem migration too... If you need to keep FDs for network sockets, you've got to transfer those somehow.
If it's just stdin/out/err, you could probably do the migration in userspace with some difficulty if you need to keep pid constant (but maybe you don't need that either).
Minimal pausing involves letting the program run on the initial machine while you copy memory, setting pages to read-only so you can catch writes, and only pausing the program once the copy is substantially finished. Then you pause execution on the initial machine. If there's a significant amount of modified pages to copy over when you pause, you can still start execution on the new machine, as long as the modified pages are marked unavailable, if you background copy them before they're used great... if not, you have to block until the modified data comes through.
Probably you do this on two nearby machines with fast networking, and the program doesn't have a lot of writes all over memory, so the pause should be short.