Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

421 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

quietbritishjim ◴[31 Oct 25 23:46 UTC] No.45777951[source]▶

>>45774086 (OP) #

Wow, it is simply outrageous that Rust doesn't just allow all active tasks to make progress. It creates a whole class of incomprehensible bugs, like this one, for no reason. Can any Rust experts explain why it's done this way? It seems like an unforced error.

In Python, I often use the Trio library, which offers "structured, concurrency": tasks are (only) spawned into lexical scopes, and they are all completed (waited for) before that scope is left. That includes waiting for any cancelled tasks (which are allowed to do useful async work, including waiting for any of their own task scopes to complete).

Could Rust do something like that? It's far easier to reason about than traditional async programs, which seems up Rust's street. As a bonus it seems to solve this problem, since a Rust equivalent would presumably have all tasks implicitly polled by their owning scope.

replies(2): >>45777974 #>>45777999 #

duped ◴[31 Oct 25 23:51 UTC] No.45777974[source]▶

>>45777951 #

So there's a distinction between a task and a future. A future doesn't do anything until it's polled, and since there's nothing special about async runtimes (it's just user level code), it's always possible to create futures and never poll them, or stop polling them.

A task is a different construct and usually tied to the runtime. If you look at the suggestions in the RFD they call out using a task explicitly instead of polling a future in place.

There's some debate to be had over what constitutes "cancellation." The article and most colloquial definitions I've heard define it as a future being dropped before being polled to completion. Which is very clean - if you want to cancel a future, just drop it. Since Rust strongly encourages RAII, cleanup can go in drop implementations.

A much tougher definition of cancellation is "the future is never polled again" which is what the article hits on. The future isn't dropped but its poll is also unreachable, hence the deadlock.

replies(2): >>45778028 #>>45778044 #

quietbritishjim ◴[01 Nov 25 00:02 UTC] No.45778044[source]▶

>>45777974 #

Interesting, thanks. So is it fair to say that if tokio::select!() only accepted tasks (or implicitly turned any futures it receives into tasks, like Python's asyncio.gather() does) then it wouldn't have this problem? Or, even if the async runtime is careful, is it still possible to create and fail to poll a raw Future by accident?

replies(2): >>45778247 #>>45781799 #

1. duped ◴[01 Nov 25 14:15 UTC] No.45781799{3}[source]▶

>>45778044 #

It's always possible to create a future that is never polled, and this is a feature of Rusts zero cost abstraction for async/await. If tokio::select required tasks it would be a lot less useful.

This problem would have been avoided by taking the future by value instead of by reference.

↑