Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

421 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

Matthias247 ◴[31 Oct 25 22:03 UTC] No.45777191[source]▶

>>45774086 (OP) #

As far as I remember from building these things with others within the async rust ecosystem (hey Eliza!) was that there was a certain tradeoff: if you wouldn’t be able to select on references, you couldn’t run into this issue. However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

I fully agree that this and the cancellation issues discussed before can lead to surprising issues even to seasoned Rust experts. But I’m not sure what really can be improved under the main operating model of async rust (every future can be dropped).

But compared to working with callbacks the amount of surprising things is still rather low :)

replies(2): >>45777316 #>>45777340 #

octoberfranklin ◴[31 Oct 25 22:20 UTC] No.45777316[source]▶

>>45777191 #

> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state. Yes, this is awkward, but it's the only logically coherent way. There is probably some macro voodoo that makes it ergonomic. But even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

For the real root cause: https://news.ycombinator.com/item?id=45777234

replies(1): >>45777364 #

mycoliza ◴[31 Oct 25 22:26 UTC] No.45777364[source]▶

>>45777316 #

> No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state.

How does that prevent this kind of deadlock? If the owned future has acquired a mutex, and you return that future from the select so that it might be polled again, and the user assigns it to a variable, then the future that has acquired the mutex but has not completed is still not dropped. This is basically the same as polling an `&mut future`, but with more steps.

replies(1): >>45777467 #

1. octoberfranklin ◴[31 Oct 25 22:37 UTC] No.45777467{3}[source]▶

>>45777364 #

> How does that prevent this kind of deadlock?

Like I said, it doesn't:

> even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

It fixes this:

> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

If you want to fix the root cause, see https://news.ycombinator.com/item?id=45777234

↑