←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
427 points bcantrill | 9 comments | | HN request time: 1.116s | source | bottom

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context
quietbritishjim ◴[] No.45777951[source]
Wow, it is simply outrageous that Rust doesn't just allow all active tasks to make progress. It creates a whole class of incomprehensible bugs, like this one, for no reason. Can any Rust experts explain why it's done this way? It seems like an unforced error.

In Python, I often use the Trio library, which offers "structured, concurrency": tasks are (only) spawned into lexical scopes, and they are all completed (waited for) before that scope is left. That includes waiting for any cancelled tasks (which are allowed to do useful async work, including waiting for any of their own task scopes to complete).

Could Rust do something like that? It's far easier to reason about than traditional async programs, which seems up Rust's street. As a bonus it seems to solve this problem, since a Rust equivalent would presumably have all tasks implicitly polled by their owning scope.

replies(2): >>45777974 #>>45777999 #
1. duped ◴[] No.45777974[source]
So there's a distinction between a task and a future. A future doesn't do anything until it's polled, and since there's nothing special about async runtimes (it's just user level code), it's always possible to create futures and never poll them, or stop polling them.

A task is a different construct and usually tied to the runtime. If you look at the suggestions in the RFD they call out using a task explicitly instead of polling a future in place.

There's some debate to be had over what constitutes "cancellation." The article and most colloquial definitions I've heard define it as a future being dropped before being polled to completion. Which is very clean - if you want to cancel a future, just drop it. Since Rust strongly encourages RAII, cleanup can go in drop implementations.

A much tougher definition of cancellation is "the future is never polled again" which is what the article hits on. The future isn't dropped but its poll is also unreachable, hence the deadlock.

replies(2): >>45778028 #>>45778044 #
2. wrs ◴[] No.45778028[source]
I wish "cancellation" wasn't used for both of those. It seems to obfuscate understanding quite a bit. We should call them "dropped" and "abandoned" or something.
replies(1): >>45778065 #
3. quietbritishjim ◴[] No.45778044[source]
Interesting, thanks. So is it fair to say that if tokio::select!() only accepted tasks (or implicitly turned any futures it receives into tasks, like Python's asyncio.gather() does) then it wouldn't have this problem? Or, even if the async runtime is careful, is it still possible to create and fail to poll a raw Future by accident?
replies(2): >>45778247 #>>45781799 #
4. duped ◴[] No.45778065[source]
I don't think anyone really calls the latter "cancellation" in practice. I'm just pointing out that "is never polled again" is the tricky state with futures.
5. NobodyNada ◴[] No.45778247[source]
> if tokio::select!() only accepted tasks (or implicitly turned any futures it receives into tasks, like Python's asyncio.gather() does) then it wouldn't have this problem?

Yes, this is correct. However, many of the use cases for select rely on the fact that it doesn't run all the tasks to completion. I've written many a select! statement to implements timeouts or other forms of intentionally preempting a task. Sometimes I want to cancel the task and sometimes I want to resume it after dealing with the condition that caused the preemption -- so the behavior in the article is very much an intentional feature.

> even if the async runtime is careful, is it still possible to create and fail to poll a raw Future by accident?

This is also the case. There's nothing magic about a future; it's just an ordinary object with a poll function. Any code can create a future and do whatever it likes with it; including polling it a few times and then stopping.

Despite being included as part of Tokio, select! does not interact with the runtime or need any kind of runtime support at all. It's an ordinary function that creates a future which waits for the first of several "child" futures to complete; similar functions are also provided in other prominent ecosystem crates besides Tokio and can be implemented in user code as well.

replies(1): >>45783027 #
6. duped ◴[] No.45781799[source]
It's always possible to create a future that is never polled, and this is a feature of Rusts zero cost abstraction for async/await. If tokio::select required tasks it would be a lot less useful.

This problem would have been avoided by taking the future by value instead of by reference.

7. quietbritishjim ◴[] No.45783027{3}[source]
> However, many of the use cases for select rely on the fact that it doesn't run all the tasks to completion.

That seems like a different requirement than "all arguments are tasks". If I understand it right (and quite possibly I don't), making them all tasks means that they are all polled and therefore continue progressing until they are dropped. It doesn't mean that select! would have to run them all the way to completion.

replies(2): >>45783389 #>>45784893 #
8. wrs ◴[] No.45783389{4}[source]
Doing select on tasks doesn’t really make sense semantically in the first place. Tasks are already getting polled by the executor. The purpose of select is to run some set of futures the executor doesn’t know about, until the first one of them completes. If you wanted to wait for one of a set of tasks to do something, you don’t need any additional polling, you’d just use something like a signal or a channel to communicate with them.
9. NobodyNada ◴[] No.45784893{4}[source]
I was sloppy with my wording, I should have said "it doesn't run all the futures to completion".

> making them all tasks means that they are all polled and therefore continue progressing until they are dropped. It doesn't mean that select! would have to run them all the way to completion.

This is exactly correct, but oftentimes the reason you're using select is because you don't want to run the futures all the way to completion. In my experience, the most common use cases for select are:

- An event handler loop that receives input from multiple channels. You could replace this with multiple tasks, one reading from each channel; but this could potentially mess with your design for queueing / backpressure -- often it's important for the loop to pause reading from the channels while processing the event.

- An operation that's run with a timeout, or a shutdown event from a controlling task. In this case I want the future to be dropped when the task is cancelled.

The example in the original post was the second case: an operation with a timeout. They wanted the operation to be cancelled when the timeout expired, but because the select statement borrowed the future, it only suspended the future instead of cancelling it. This is a very common code pattern when calling select! in a loop, when you want a future to be resumed instead of restarted on the next loop iteration -- it's very intentional that select! allows you to use either way, because you often want either behavior.