Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

421 points bcantrill | 2 comments | 31 Oct 25 16:49 UTC | HN request time: 0.403s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

Matthias247 ◴[31 Oct 25 22:03 UTC] No.45777191[source]▶

>>45774086 (OP) #

As far as I remember from building these things with others within the async rust ecosystem (hey Eliza!) was that there was a certain tradeoff: if you wouldn’t be able to select on references, you couldn’t run into this issue. However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

I fully agree that this and the cancellation issues discussed before can lead to surprising issues even to seasoned Rust experts. But I’m not sure what really can be improved under the main operating model of async rust (every future can be dropped).

But compared to working with callbacks the amount of surprising things is still rather low :)

replies(2): >>45777316 #>>45777340 #

mycoliza ◴[31 Oct 25 22:24 UTC] No.45777340[source]▶

>>45777191 #

Indeed, you are correct (and hi Matthias!). After we got to the bottom of this deadlock, my coworkers and I had one of our characteristic "how could we have prevented this?" conversations, and reached the somewhat sad conclusion that actually, there was basically nothing we could easily blame for this. All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too. Not being able to do that would make it pretty hard to express a lot of things that many applications might reasonably want to express, as you described. I discussed this a bit in this comment[1] as well.

On the other hand, it also wasn't our coworker who had written the code where we found the bug who was to blame, either. It wasn't a case of sloppy programming; he had done everything correctly and put the pieces together the way you were supposed to. All the pieces worked as they were supposed to, and his code seemed to be using them correctly, but the interaction of these pieces resulted in a deadlock that it would have been very difficult for him to anticipate.

So, our conclusion was, wow, this just kind of sucks. Not an indictment of async Rust as a whole, but an unfortunate emergent behavior arising from an interaction of individually well-designed pieces. Just something you gotta watch out for, I guess. And that's pretty sad to have to admit.

[1] https://news.ycombinator.com/item?id=45776868

replies(2): >>45781082 #>>45781406 #

kibwen ◴[01 Nov 25 12:21 UTC] No.45781082[source]▶

>>45777340 #

> All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too.

But it still suggests that `tokio::select` is too powerful. You don't need to get rid of `tokio::select`, you just need to consider creating a less powerful mechanism that doesn't risk exhibiting this problem. Then you could use that less powerful mechanism in the places where you don't need the full power of `tokio::select`, thereby reducing the possible places where this bug could arise. You don't need to get rid of the fully powerful mechanism, you just need to make it optional.

replies(1): >>45781622 #

1. tux3 ◴[01 Nov 25 13:48 UTC] No.45781622[source]▶

>>45781082 #

I feel like select!() is a good case study because the common future timeout use-case maps pretty closely to a select!(), so there is only so much room to weaken it.

The ways I can think of for making select!() safer all involve runtime checks and allocations (possibly this is just a failure of my imagination!). But if that's the case, I would find it bothersome if our basic async building blocks like select/timeout in practice turn out to require more expensive runtime checks or allocations to be safe.

We have a point in the async design space where we pay a complexity price, but in exchange we get really neat zero-cost futures. But I feel like we only get our money's worth if we can actually statically prove that correct use won't deadlock, without the expensive runtime checks! Otherwise, can we afford to spend all this complexity budget?

The implementation of select!() does feel way too powerful in a way (it's a whole mini scheduler that creates implicit future dependencies hidden from the rest of the executor, and then sometimes this deadlocks!). But the need is pretty foundational, it shows up everywhere as a building block.

replies(1): >>45781785 #

2. kibwen ◴[01 Nov 25 14:13 UTC] No.45781785[source]▶

>>45781622 (TP) #

It feels to me like there's plenty of design space to explore. Sure, it's possible to view "selection" as a basic building block, but even that is insufficiently precise IMO. There's a reason that Javascript provides all of Promise.any and Promise.all and Promise.allSettled and Promise.race. Selection isn't just a single building block, it's an entire family of building blocks with distinct semantics.

↑