Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

431 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

jcalvinowens ◴[01 Nov 25 05:00 UTC] No.45779372[source]▶

>>45774086 (OP) #

I have very little Rust experience... but I'm hung up on this:

> The lock is given to future1

> future1 cannot run (and therefore cannot drop the Mutex) until the task starts running it.

This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.

Why would do_async_thing() not immediately run the prints, return, and drop the lock after acquiring it? Why does future1 need to be "polled" for that to happen? I get that due to the select! behavior, the result of future1 is not consumed, but I don't understand how that prevents it from releasing the mutex.

It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then. Having to take some explicit second action to make that happen seems fundamentally broken to me...

EDIT: Rephrased for clarity.

replies(2): >>45779516 #>>45781872 #

dap ◴[01 Nov 25 14:24 UTC] No.45781872[source]▶

>>45779372 #

Your confusion is very natural:

> It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then.

This gets at why this felt like a big deal when we ran into this. This is how it would work with threads. Tasks and futures hook into our intuitive understanding of how things work with threads. (And for tasks, that's probably still a fair mental model, as far as I know.) But futures within a task are different because of the inversion of control: tasks must poll them for them to keep running. The problem here is that the task that's responsible for polling this future has essentially forgotten about it. The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.

replies(1): >>45782811 #

jcalvinowens ◴[01 Nov 25 16:11 UTC] No.45782811[source]▶

>>45781872 #

> tasks must poll them for them to keep running.

So async Rust introduces an entire novel class of subtle concurrent programming errors? Ugh, that's awful.

> The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.

Yes. But I've never written code in a preemptible protected mode environment like Linux userspace where it is possible to make that mistake. That's nuts to me.

From my POV this seems like a fundamental design flaw in async rust. Like, on a bare metal thing I expect to deal with stuff like this... but code running on a real OS shouldn't have to.

replies(2): >>45783441 #>>45783571 #

1. dap ◴[01 Nov 25 17:21 UTC] No.45783441[source]▶

>>45782811 #

I definitely hear that!

To keep it in perspective, though: we've been operating a pretty good size system that's heavy on async Rust for a few years now and this is the first we've seen this problem. Hitting it requires a bunch of things (programming patterns and runtime behavior) to come together. It's really unfortunate that there aren't guard rails here, but it's not like people are hitting this all over the place.

The thing is that the alternatives all have tradeoffs, too. With threaded systems, there's no distinction in code between stuff that's quick vs. stuff that can block, and that makes it easy to accidentally do time-consuming (blocking) work in contexts that don't expect it (e.g., a lock held). With channels / message passing / actors, having the receiver/actor go off and do something expensive is just as bad as doing something expensive with a lock held. There are environments that take this to the extreme where you can't even really block or do expensive things as an actor, but there the hidden problem is often queueing and backpressure (or lack thereof). There's just no free lunch.

I'd certainly think carefully in choosing between sync vs. async Rust. But we've had a lot fewer issues with both of these than I've had in my past experience working on threaded systems in C and Java and event-oriented systems in C and Node.js.

↑