We clone a running VM in 2 seconds (2022)

1. londons_explore ◴[11 Apr 25 14:08 UTC] No.43653973[source]▶

Unmentioned: there are serious security issues with memory cloning code not designed for it.

For example, an SSL library might have pre-calculated the random nonce for the next incoming SSL connection.

If you clone the VM containing a process using that library, now both child VM's will use the same nonce. Some crypto is 100% broken open if a nonce is reused.

replies(7): >>43654026 #>>43654396 #>>43654513 #>>43654702 #>>43654894 #>>43655157 #>>43657321 #

2. generalizations ◴[11 Apr 25 14:13 UTC] No.43654026[source]▶

>>43653973 (TP) #

Sounds like it would simply be inappropriate to clone & use a VM that's assuming it's data is unique. This would also be true of other conditions, e.g. if you needed to spoof a MAC or IPv6 address & picked one randomly.

replies(1): >>43654077 #

3. londons_explore ◴[11 Apr 25 14:17 UTC] No.43654077[source]▶

>>43654026 #

The problem is modern software is so fiendishly complicated there almost certainly is stuff like that in the code. The question is where, and does it matter?

replies(1): >>43654228 #

4. generalizations ◴[11 Apr 25 14:32 UTC] No.43654228{3}[source]▶

>>43654077 #

And the last question is, can the parts with stuff like that be extracted from the rest and run separately?

5. hypeatei ◴[11 Apr 25 14:46 UTC] No.43654396[source]▶

>>43653973 (TP) #

> might have pre-calculated the random nonce

Isn't this still a concern even if you're not pre-calculating way ahead of time? If you generate it when needed, it could still catch you at the wrong time (e.g. right before encryption, but right after nonce generation)

replies(1): >>43654654 #

6. sunshinekitty ◴[11 Apr 25 14:56 UTC] No.43654513[source]▶

>>43653973 (TP) #

GCP’s ‘live migrations’ have been doing this for close to a decade or more. Must not be that big of a problem.

replies(2): >>43654524 #>>43657289 #

7. londons_explore ◴[11 Apr 25 14:57 UTC] No.43654524[source]▶

>>43654513 #

It isn't a problem if you guarantee only one child of the clone lives on - which GCP does.

replies(1): >>43654845 #

8. zamadatix ◴[11 Apr 25 15:07 UTC] No.43654654[source]▶

>>43654396 #

Unless your encryption and transport protocols are 100% stateless only 1 connection will actually be able to form, even if you duplicate the machine during connection creation.

The problem with pre-computing a bunch and keeping them in memory is brand new connections made post cloning would use the same list of nonces.

9. hedora ◴[11 Apr 25 15:10 UTC] No.43654702[source]▶

>>43653973 (TP) #

I was about to say you were being paranoid, then I read the article. It hadn’t occurred to me that anyone would be so reckless!

The proposed workflow involves cloning your dev environment and sharing it with the internet.

At most places, that’s equivalent to publishing your production keys, or at least github credentials.

Even for open source projects where confidentiality doesn’t matter, there are issues like using cargo/npm/etc keys to launch supply chain attacks.

Your nonce attack is harder to pull off, but more devastating if the attacker can man in the middle things like dependency downloads.

10. matt-p ◴[11 Apr 25 15:20 UTC] No.43654845{3}[source]▶

>>43654524 #

How do we know that isn't enforced here too?

replies(1): >>43655491 #

11. perching_aix ◴[11 Apr 25 15:23 UTC] No.43654894[source]▶

>>43653973 (TP) #

I don't really follow, what's the issue with that? The two nodes will encrypt using the same key, so they can snoop at each other's traffic that they send out? Doesn't sound that big of a deal per se.

replies(2): >>43655173 #>>43655673 #

12. CompuIves ◴[11 Apr 25 15:45 UTC] No.43655157[source]▶

>>43653973 (TP) #

Yes, that's right. The Firecracker team has written a fantastic doc about this as well: https://github.com/firecracker-microvm/firecracker/blob/main....

It's important to refresh entropy immediately after clone. Still, there can be code that didn't assume it could be cloned (even though there's always been `fork`, of course). Because of this, we don't live clone across workspaces for unlisted/private sandboxes and limit the use case to dev envs where no secrets are stored.

13. Rygian ◴[11 Apr 25 15:46 UTC] No.43655173[source]▶

>>43654894 #

A nonce is not a key, it's a piece of random that is meant to be used at most once.

If an attacker sees valid nonces on a VM, and knows of another VM sharing the same nonces, then your crypto on both* VMs becomes vulnerable to replay attacks.

*read: all

replies(2): >>43655417 #>>43656303 #

14. nodesocket ◴[11 Apr 25 16:08 UTC] No.43655417{3}[source]▶

>>43655173 #

How would a reply attack work in production assuming multiple VMs share a nonce?

replies(1): >>43655794 #

15. jsnell ◴[11 Apr 25 16:15 UTC] No.43655491{4}[source]▶

>>43654845 #

Because their main selling point is to run the copies concurrently with the original.

16. londons_explore ◴[11 Apr 25 16:30 UTC] No.43655673[source]▶

>>43654894 #

Reusing a nonce often allows the entire world to decrypt or MITM the data.

17. saagarjha ◴[11 Apr 25 16:40 UTC] No.43655794{4}[source]▶

>>43655417 #

You record the traffic going to one VM and send it to another, which will now accept it because the nonce is the same.

18. trollied ◴[11 Apr 25 17:30 UTC] No.43656303{3}[source]▶

>>43655173 #

“Number ONCE”. NONCE. Indeed.

19. oceanplexian ◴[11 Apr 25 19:03 UTC] No.43657289[source]▶

>>43654513 #

Live Migration on VMWare has been a thing before Google even had a cloud service.

replies(1): >>43657602 #

20. dietr1ch ◴[11 Apr 25 19:06 UTC] No.43657321[source]▶

>>43653973 (TP) #

A neat use case for cloning is not truly duplicating a machine, but moving it from one machine that will go off to another one.

There's caveats in the network though, as packets targeting the old address need to be re-routed until all connections target the new machine.

21. tanelpoder ◴[11 Apr 25 19:34 UTC] No.43657602{3}[source]▶

>>43657289 #

VMware even has a vSphere Fault Tolerance product that creates a "live shadow instance" of a VM that mirrors the primary virtual machine (with up to 4 vCPUs). So you can do a quick failover in case of an "immediate planned" failover case, but apparently even when the primary DB goes down. I guess this might work when some external system (like a storage array) goes down in the primary, you can just switch to the other VM (with latest memory/CPU state) and replay that I/O there and keep going... But if there's a hard crash of the primary, if it actually does work, then they must be doing lots of reasoning about internal state change ordering & external device side-effect (somewhat like Antithesis, but for a different purpose). Back in the day, they supported only uniprocessor VMs (with something called vLockstep) and later up to 4 vCPUs with something called Fast Checkpointing.

I've always wanted to test this out for fun, by now 15 years have gone by and I've never got to it...

https://www.vmware.com/products/cloud-infrastructure/vsphere...

replies(1): >>43657915 #

22. umachin ◴[11 Apr 25 20:01 UTC] No.43657915{4}[source]▶

>>43657602 #

VMware has also had a patent on live VM cloning (called it VMfork) for quite a few years now. I worked on the team that built related features. Feature itself was in the desktop product. https://blogs.vmware.com/euc/2016/02/horizon-7-view-instant-...

Live migration had some very cool demos. They would have an intensive workload such as a game playing and cause a crash and the VM would resume with 0 buffering.