Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

421 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0.001s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

singron ◴[31 Oct 25 21:14 UTC] No.45776779[source]▶

>>45774086 (OP) #

This sounds very similar to priority inversion. E.g. if you have Thread T_high running at high priority and thread T_low running at low priority, and T_low holds a lock that T_high wants to acquire, T_high won't get to run until T_low gets scheduled.

The OS can detect this and make T_low "inherit" the priority of T_high. I wonder if there is a similar idea possible with tokio? E.g. if you are awaiting a Mutex held by a future that "can't run", then poll that future instead. I would guess detecting the "can't run" case would require quite a bit of overhead, but maybe it can be done.

I think an especially difficult factor is that you don't even need to use a direct await.

    let future1 = do_async_thing("op1", lock.clone()).boxed();
    tokio::select! {
      _ = &mut future1 => {
        println!("do_stuff: arm1 future finished");
      }
      _ = sleep(Duration::from_millis(500)) => {
        // No .await, but both will futurelock on future1.
        tokio::select! {
          _ = do_async_thing("op2", lock.clone()) => {},
          _ = do_async_thing("op3", lock.clone()) => {},
        };
      }
    };

I.e. so "can't run" detector needs to determine that no other task will run the future, and the future isn't in the current set of things being polled by this task.

replies(5): >>45776870 #>>45777270 #>>45777348 #>>45777553 #>>45781986 #

oconnor663 ◴[31 Oct 25 21:24 UTC] No.45776870[source]▶

>>45776779 #

> I wonder if there is a similar idea possible with tokio? E.g. if you are awaiting a Mutex held by a future that "can't run", then poll that future instead.

Something like this could make sense for Tokio tasks. (I don't know how complicated their task scheduler is; maybe it already does stuff like this?) But it's not possible for futures within a task, as in this post. This goes all the way back to the "futures are inert" design of async Rust: You don't necessarily need to communicate with the runtime at all to create a future or to poll it or to stop polling it. You only need to talk to the runtime at the task level, either to spawn new tasks, or to wake up your own task. Futures are pretty much just plain old structs, and Tokio doesn't know how many futures my async function creates internally, any more than it knows about my integers or strings or hash maps.

replies(2): >>45777185 #>>45777246 #

newpavlov ◴[31 Oct 25 22:10 UTC] No.45777246[source]▶

>>45776870 #

>This goes all the way back to the "futures are inert" design of async Rust

Yeap. And this footgun is yet another addition to the long list of reasons why I consider the Rust async model with its "inert" futures managed in user space a fundamentally flawed un-Rusty design.

replies(1): >>45777690 #

filleduchaos ◴[31 Oct 25 23:04 UTC] No.45777690{3}[source]▶

>>45777246 #

I feel there's a difference between a preference and a flaw. Rust has targets that make anything except inert futures simply unworkable, and in my opinion it's entirely valid for a programming language to prioritise those targets.

replies(1): >>45777971 #

Rusky ◴[31 Oct 25 23:50 UTC] No.45777971{4}[source]▶

>>45777690 #

The requirement is that the futures are not separate heap allocations, not that they are inert.

It's not at all obvious that Rust's is the only possible design that would work here. I strongly suspect it is not.

In fact, early Rust did some experimentation with exactly the sort of stack layout tricks you would need to approach this differently. For example, see Graydon's post here about the original implementation of iterators, as lightweight coroutines: https://old.reddit.com/r/ProgrammingLanguages/comments/141qm...

replies(2): >>45778063 #>>45779540 #

vlovich123 ◴[01 Nov 25 05:57 UTC] No.45779540{5}[source]▶

>>45777971 #

If it’s not inert, how do you use async in the kernel or microcontrollers? A non-inert implementation presumes a single runtime implementation within std+compiler and not usable in environments where you need to implement your own meaning of dispatch.

replies(2): >>45781492 #>>45782254 #

1. Rusky ◴[01 Nov 25 15:10 UTC] No.45782254{6}[source]▶

>>45779540 #

"Not inert" does not at all imply "a single runtime within std+compiler." You've jumped way too far in the opposite direction there.

The problem is that the particular interface Rust chose for controlling dispatch is not granular enough. When you are doing your own dispatch, you only get access to separate tasks, but for individual futures you are at the mercy of combinators like `select!` or `FuturesUnordered` that only have a narrow view of the system.

A better design would continue to avoid heap allocations and allow you to do your own dispatch, but operate in terms of individual suspended leaf futures. Combinators like `join!`/`select!`/etc. would be implemented more like they are in thread-based systems, waiting for sub-tasks to complete, rather than being responsible for driving them.

↑