←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
421 points bcantrill | 10 comments | | HN request time: 0.745s | source | bottom

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

1. dvt ◴[] No.45776550[source]
> &mut future1 is dropped, but this is just a reference and so has no effect. Importantly, the future itself (future1) is not dropped.

There's a lot of talk about Rust's await implementation, but I don't really think that's the issue here. After all, Rust doesn't guarantee convergence. Tokio, on the other hand (being a library that handles multi-threading), should (at least when using its own constructs, e.g. the `select!` macro).

So, since the crux of the problem is the `tokio::select!` macro, it seems like a pretty clear tokio bug. Side note, I never looked at it before, but the macro[1] is absolutely hideous.

[1] https://docs.rs/tokio/1.34.0/src/tokio/macros/select.rs.html

replies(4): >>45776605 #>>45776791 #>>45776868 #>>45777449 #
2. raggi ◴[] No.45776605[source]
i forget if this part unwinds to the exact same place, but some of this kind of design constraint in tokio stems from the much earlier language capabilities and is prohibitive to adjust without breaking the user ecosystem.

one of the key advertised selling points in some of the other runtimes was specifically around behavior of tasks on drop of their join handles for example, for reasons closely related to this post.

3. oconnor663 ◴[] No.45776791[source]
There's nothing `select!` could do here to force `future1` to drop, because it doesn't receive ownership of `future1`. If we wanted to force this, we'd have to forbid `select!` from polling futures by reference, but that's a pretty fundamental capability that we often rely on to `select!` in a loop for example. The blanket `impl<F> Future for &mut F where F: Future ...` isn't a Tokio thing either; that's in the standard library.
replies(1): >>45781430 #
4. mycoliza ◴[] No.45776868[source]
What could `tokio::select!` do differently here to prevent bugs like this?

In the case of `select!`, it is a direct consequence of the ability to poll a `&mut` reference to a future in a `select!` arm, where the future is not dropped should another future win the "race" of the select. This is not really a choice Tokio made when designing `select!`, but is instead due to the existence of implementations of `Future` for `&mut T: Future + Unpin`[1] and `Pin<T: Future>`[2] in the standard library.

Tokio's `select!` macro cannot easily stop the user from doing this, and, furthermore, the fact that you can do this is useful --- there are many legitimate reasons you might want to continue polling a future if another branch of the select completes first. It's desirable to be able to express the idea that we want to continually poll drive one asynchronous operation to completion while periodically checking if some other thing has happened and taking action based on that, and then continue driving forward the ongoing operation. That was precisely what the code in which we found the bug was doing, and it is a pretty reasonable thing to want to do; a version of the `select!` macro which disallows that would limit its usefulness. The issue arises specifically from the fact that the `&mut future` has been polled to a state in which it has acquired, but not released, a shared lock or lock-like resource, and then another arm of the `select!` completes first and the body of that branch runs async code that also awaits that shared resource.

If you can think of an API change which Tokio could make that would solve this problem, I'd love to hear it. But, having spent some time trying to think of one myself, I'm not sure how it would be done without limiting the ability to express code that one might reasonably want to be able to write, and without making fundamental changes to the design of Rust async as a whole.

[1] https://doc.rust-lang.org/stable/std/future/trait.Future.htm... [2]: https://doc.rust-lang.org/stable/std/future/trait.Future.htm...

replies(3): >>45777894 #>>45778241 #>>45778288 #
5. dap ◴[] No.45777449[source]
(author here)

Although the design of the `tokio::select!` macro creates ways to run into this behavior, I don't believe the problem is specific to `tokio`. Why wouldn't the example from the post using Streams happen with any other executor?

replies(1): >>45780559 #
6. amluto ◴[] No.45777894[source]
I can imagine an alternate universe in which you cannot do:

1. Create future A.

2. Poll future A at least once but not provably poll it to completion and also not drop it. This includes selecting it.

3. Pause yourself by awaiting anything that does not involve continuing to poll A.

I’m struggling a bit to imagine the scenario in which it makes sense to pause a coroutine that you depend on in the middle like this. But I also don’t immediately see a way to change a language like Rust to reliably prevent doing this without massively breaking changes. See my other comment :)

7. grogers ◴[] No.45778241[source]
I'm not familiar with tokio, but I am familiar with folly coro in C++ which is similiar-ish. You cannot co_await a folly::coro::Task by reference, you must move it. It seems like that prevents this bug. So maybe select! is the low level API and a higher level (i.e. safer) abstraction can be built on top?
8. rtpg ◴[] No.45778288[source]
A meta-idea I have: look at all usages of `select!` with `&mut future`s in the code, and see if there are maybe 4 or 5 patterns that emerge. With that it might be possible to say "instead of `select!` use `poll_continuing!` or `poll_first_up!` or `poll_some_other_common_pattern!`".

It feels like a lot of the way Rust untangles these tricky problems is by identifying slightly more contextful abstractions, though at the cost of needing more scratch space in the mind for various methods

9. dvt ◴[] No.45780559[source]
First of all, great write-up! Had a blast reading it :) I think there's a difference between a language giving you a footgun and a library giving you a footgun. Libraries, by definition, are supposed to be as user-friendly as possible.

For example, I can just do `loop { }` which the language is perfectly okay with letting me do anywhere in my code (and essentially hanging execution). But if I'm using a library and I'm calling `innocuous()` and there's a `loop { }` buried somewhere in there, that is (in my opinion) the library's responsibility.

N.B. I don't know enough about tokio's internals to suggest any changes and don't want to pretend like I'm an expert, but I do think this caveat should be clearly documented and a "safe" version of `select!` (which wouldn't work with references) should be provided.

10. kibwen ◴[] No.45781430[source]
Surely not every use of `select!` needs this ability. If you can design a more restrictive interface that makes correctness easier to determine, then you should use that interface where you can, and reserve `select!` for only those cases where you can't.