Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

jacquesm ◴[31 Oct 25 20:43 UTC] No.45776483[source]▶

>>45774086 (OP) #

If any rust designers are lurking about here: what made you decide to go for the async design pattern instead of the actor pattern, which - to me at least - seems so much cleaner and so much harder to get wrong?

Ever since I started using Erlang it felt like I finally found 'the right way' when before then I did a lot of work with sockets and asynchronous worker threads. But even though it usually worked as advertised it had a large number of really nasty pitfalls which the actor model seemed to - effortlessy - step aside.

So I'm seriously wondering what the motivation was. I get why JS uses async, there isn't any other way there, by the time they added async it was too late to change the fundamentals of the language to such a degree. But rust was a clean slate.

replies(5): >>45776498 #>>45776569 #>>45776637 #>>45776798 #>>45777596 #

raggi ◴[31 Oct 25 20:53 UTC] No.45776569[source]▶

>>45776483 #

_an answer_ is performance - the necessity of creating copyable/copied messages for inter-actor communication everywhere in the program _can be_ expensive.

that said there are a lot of parts of a lot of programs where a fully inlined and shake optimized async state machine isn't so critical.

it's reasonable to want a mix, to use async which can be heavily compiler optimized for performance sensitive paths, and use higher level abstractions like actors, channels, single threaded tasks, etc for less sensitive areas.

replies(1): >>45776648 #

lll-o-lll ◴[31 Oct 25 21:01 UTC] No.45776648[source]▶

>>45776569 #

I’m not sure this is actually true? Do messages have to be copied?

replies(1): >>45776721 #

1. raggi ◴[31 Oct 25 21:08 UTC] No.45776721[source]▶

>>45776648 #

if you want your actors to be independent computation flows and they're in different coroutines or threads, then you need to arrange that the data source can not modify the data once it arrives at the destination, in order to be safe.

in a single threaded fully cooperative environment you could ensure this by implication of only one coroutine running at a time, removing data races, but retaining logical ones.

if you want to eradicate logical races, or have actual parallel computation, then the source data must be copied into the message, or the content of the message be wrapped in a lock or similar.

in almost all practical scenarios this means the data source copies data into messages.

replies(3): >>45776860 #>>45776886 #>>45777223 #

2. gleenn ◴[31 Oct 25 21:24 UTC] No.45776860[source]▶

>>45776721 (TP) #

Isn't that something Rust is particularly good at, controlling the mutation of shared memory?

replies(1): >>45777113 #

3. vlovich123 ◴[31 Oct 25 21:26 UTC] No.45776886[source]▶

>>45776721 (TP) #

In Rust wouldn’t you just Send the data?

4. raggi ◴[31 Oct 25 21:52 UTC] No.45777113[source]▶

>>45776860 #

yes

5. sapiogram ◴[31 Oct 25 22:07 UTC] No.45777223[source]▶

>>45776721 (TP) #

Rust solves this at compile-time with move semantics, with no runtime overhead. This feature is arguably why Rust exists, it's really useful.

replies(2): >>45777844 #>>45777909 #

6. raggi ◴[31 Oct 25 23:28 UTC] No.45777844[source]▶

>>45777223 #

if you can always move the data that's the sweet spot for async, you just pass it down the stack and nothing matters.

all of the complexity comes in when more than one part of the code is interested in the state at the same time, which is what this thread is about.

7. zorgmonkey ◴[31 Oct 25 23:39 UTC] No.45777909[source]▶

>>45777223 #

Rust moves are a memcpy where the source becomes effectively unitialized after the move (that is say it is undefined to access it after the move). The copies are often optimized by the compiler but it isn't guaranteed.

This actually caused some issues with rust in the kernel because moving large structs could cause you to run out the small amount of stack space availabe on kernel threads (they only allocate 8-16KB of stack compared to a typical 8MB for a userspace thread). The pinned-init crate is how they ended solving this [1].

[1] https://crates.io/crates/pinned-init

↑