Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

431 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0.199s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

jcalvinowens ◴[01 Nov 25 05:00 UTC] No.45779372[source]▶

>>45774086 (OP) #

I have very little Rust experience... but I'm hung up on this:

> The lock is given to future1

> future1 cannot run (and therefore cannot drop the Mutex) until the task starts running it.

This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.

Why would do_async_thing() not immediately run the prints, return, and drop the lock after acquiring it? Why does future1 need to be "polled" for that to happen? I get that due to the select! behavior, the result of future1 is not consumed, but I don't understand how that prevents it from releasing the mutex.

It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then. Having to take some explicit second action to make that happen seems fundamentally broken to me...

EDIT: Rephrased for clarity.

replies(2): >>45779516 #>>45781872 #

oconnor663 ◴[01 Nov 25 05:49 UTC] No.45779516[source]▶

>>45779372 #

> This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.

`future1` did run for a bit, and it got far enough to acquire the mutex. (As the article mentioned, technically it took a position in a queue that means it will get the mutex, but that's morally the same thing here.) Then it was "paused". I put "paused" in scare quotes because it kind of makes futures sound like processes or threads, which have a "life of their own" until/unless something "interrupts" them, but an important part of this story is that Rust futures aren't really like that. When you get down to the details, they're more like a struct or a class that just sits there being data unless you call certain methods on it (repeatedly). That's what the `.await` keyword does for you, but when you use more interesting constructs like `select!`, you start to get more of the details in your face.

It's hard to be more concrete than that without getting into an overwhelming amount of detail. I wrote a set of blog posts that try to cover it without hand-waving the details away, but they're not short, and they do require some Rust background: https://jacko.io/async_intro.html

replies(1): >>45782949 #

jcalvinowens ◴[01 Nov 25 16:26 UTC] No.45782949[source]▶

>>45779516 #

So my understanding was correct, it requires the programmer to deal with scheduling explicitly in userspace.

If I'm writing bare metal code for e.g. a little cortex M0, I can very much see the utility of this abstraction.

But it seems like an absolutely absurd exercise for code running in userspace on a "real" OS like Linux. There should be some simpler intermediate abstraction... this seems like a case of forcing a too-complex interface on users who don't really require it.

replies(2): >>45783296 #>>45783439 #

1. oconnor663 ◴[01 Nov 25 17:21 UTC] No.45783439[source]▶

>>45782949 #

To be clear, if you restrict yourself to `async`/`.await` syntax, you never see any of this. To await something means to poll it to completion, which is usually what you want. "Joining" two futures lets you poll both of them concurrently until they're both done, which is kind of the point of async as a concept, and this also doesn't really require you to think about scheduling. One place where things get hairy (like in this article) is "selecting" on futures, which polls them all until one of them is done, and then stops polling the rest. (Normally I'd loosely say it "drops the rest on the floor", but the deadlock in this article actually hinges on exactly what gets "dropped" when, in the Rust sense of the `Drop` trait.) This is where scheduling as you put it, or "cancellation" as Rust folks often put it, starts to become important. And that's why the article concludes "In the end, you should always be extremely careful with tokio::select!" However, `select!` is not the only construct that raises these issues. Speaking of which...

> But it seems like an absolutely absurd exercise for code running in userspace on a "real" OS like Linux

Clearly you have a point here, which is why these blog posts are making an impact. That said, one counterpoint is, have you ever wished you could kill a thread? The reason there are so many old Raymond Chen "How many times does it have to be said: Never call TerminateThread" blog posts, is that lots of real world applications really desperately want to call TerminateThread, and it's hard to persuade them to stop! The ability to e.g. put a timeout on any async function call is basically this same superpower, without corrupting your whole process (yay), but still with the unavoidable(?) difficulty of thinking about what happens when random functions give up halfway through.

↑