Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

singron ◴[31 Oct 25 21:14 UTC] No.45776779[source]▶

>>45774086 (OP) #

This sounds very similar to priority inversion. E.g. if you have Thread T_high running at high priority and thread T_low running at low priority, and T_low holds a lock that T_high wants to acquire, T_high won't get to run until T_low gets scheduled.

The OS can detect this and make T_low "inherit" the priority of T_high. I wonder if there is a similar idea possible with tokio? E.g. if you are awaiting a Mutex held by a future that "can't run", then poll that future instead. I would guess detecting the "can't run" case would require quite a bit of overhead, but maybe it can be done.

I think an especially difficult factor is that you don't even need to use a direct await.

    let future1 = do_async_thing("op1", lock.clone()).boxed();
    tokio::select! {
      _ = &mut future1 => {
        println!("do_stuff: arm1 future finished");
      }
      _ = sleep(Duration::from_millis(500)) => {
        // No .await, but both will futurelock on future1.
        tokio::select! {
          _ = do_async_thing("op2", lock.clone()) => {},
          _ = do_async_thing("op3", lock.clone()) => {},
        };
      }
    };

I.e. so "can't run" detector needs to determine that no other task will run the future, and the future isn't in the current set of things being polled by this task.

replies(5): >>45776870 #>>45777270 #>>45777348 #>>45777553 #>>45781986 #

oconnor663 ◴[31 Oct 25 21:24 UTC] No.45776870[source]▶

>>45776779 #

> I wonder if there is a similar idea possible with tokio? E.g. if you are awaiting a Mutex held by a future that "can't run", then poll that future instead.

Something like this could make sense for Tokio tasks. (I don't know how complicated their task scheduler is; maybe it already does stuff like this?) But it's not possible for futures within a task, as in this post. This goes all the way back to the "futures are inert" design of async Rust: You don't necessarily need to communicate with the runtime at all to create a future or to poll it or to stop polling it. You only need to talk to the runtime at the task level, either to spawn new tasks, or to wake up your own task. Futures are pretty much just plain old structs, and Tokio doesn't know how many futures my async function creates internally, any more than it knows about my integers or strings or hash maps.

replies(2): >>45777185 #>>45777246 #

1. mycoliza ◴[31 Oct 25 22:02 UTC] No.45777185[source]▶

>>45776870 #

Yeah, a coworker coming from Go asked a similar question about why Rust doesn't have something like the Go runtime's deadlock detector. Your comment is quite similar to the explanation I gave him.

Go, unlike Rust, does not really have a notion of intra-task concurrency; goroutines are the fundamental unit of concurrency and parallelism. So, the Go runtime can reason about dependencies between goroutines quite easily, since goroutines are the things which it is responsible for scheduling. The fact that channels are a language construct, rather than a library construct implemented in the language, is necessary for this too. In (async) Rust, on the other hand, tasks are the fundamental unit of parallelism, but not of concurrency; concurrency emerges from the composition of `Future`s, and a single task is a state machine which may execute any number of futures concurrently (but not in parallel), by polling them until they cannot proceed without waiting and then moving on to poll another future until it cannot proceed without waiting. But critically, this is not what the task scheduler sees; it interacts with these tasks as a single top-level `Future`, and is not able to look inside at the nested futures they are composed of.

This specific failure mode can actually only happen when multiple futures are polled concurrently but not in parallel within a single Tokio task. So, there is actually no way for the Tokio scheduler to have insight into this problem. You could imagine a deadlock detector in the Tokio runtime that operates on the task level, but it actually could never detect this problem, because when these operations execute in parallel, it actually cannot occur. In fact, one of the suggestions for how to avoid this issue is to select over spawned tasks rather than futures within the same task.

replies(2): >>45778715 #>>45780306 #

2. mjevans ◴[01 Nov 25 02:10 UTC] No.45778715[source]▶

>>45777185 (TP) #

Thank you. Every time I've tried to approach the concept of Rust's parallelism this is what rubs me the wrong way.

I haven't yet read a way to prove it's correct, or even to reasonably prove a given program's use is not going to block.

With more traditional threads my mental model is that _everything_ always has to be interrupt-able, have some form of engineer chosen timeout for a parallel operation, and address failure of operation in design.

I never see any of that in the toy examples that are presented as educational material. Maybe Rust's async also requires such careful design to be safely utilized.

replies(1): >>45779051 #

3. zenmac ◴[01 Nov 25 03:34 UTC] No.45779051[source]▶

>>45778715 #

Guess Rust is more built for memory safety not concurrency? Erlang maybe? Why can't we just have a language that is memory safe and built for concurrency? Like Ocaml and Erlang combine?

replies(2): >>45779560 #>>45781053 #

4. jitl ◴[01 Nov 25 06:02 UTC] No.45779560{3}[source]▶

>>45779051 #

Are you looking for Gleam? Simple but powerful typed functional language for BEAM and JavaScript. It’s a bit high level compared to Ocaml in terms of needing a thick runtime and being somewhat far from machine code.

Really beautiful language design imo. Does a great job avoiding the typelevel brainfuck problem I have with Haskell.

https://gleam.run/

5. gf000 ◴[01 Nov 25 09:18 UTC] No.45780306[source]▶

>>45777185 (TP) #

I think an important takeaway here that many often ignore is that in language design, not having low-level control over something is sometimes just as important design tradeoff as having it.

From that it also follows that it may not be too fruitful to try to tackle every domain there is with a single language only.

(With that said, I absolutely love sync Rust, and Go is definitely not a good example of an elegantly designed language, I am talking in a more general way here)

6. kibwen ◴[01 Nov 25 12:15 UTC] No.45781053{3}[source]▶

>>45779051 #

Rust is absolutely built for concurrency, even moreso than for memory safety--it just so happens that memory safety is a prerequisite for thread safety. You're going to have a hard time finding any other industrial-strength language that statically prevents data races. If you can use Erlang, then sure, use Erlang. But if you can't use Erlang, and you need concurrency, you're not going to find a better candidate than Rust.

↑