←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
427 points bcantrill | 2 comments | | HN request time: 0.418s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

1. imtringued ◴[] No.45781104[source]
Based on the description:

>This RFD describes futurelock: a type of deadlock where a resource owned by Future A is required for another Future B to proceed, while the Task responsible for both Futures is no longer polling A. Futurelock is a particularly subtle risk in writing asynchronous Rust.

I was honestly wondering how you could possibly cause this in any sane code base. How can an async task hold a lock and keep it open? It sounds illogical, because critical sections are meant to be short and never interrupted by anything. You're also never allowed to panic, which means you have to write no panic Rust code inside a critical section. Critical sections are very similar to unsafe blocks, but with the caveat that they cannot cause complete take over of your application.

So how exactly did they bring about the impossible? They put an await call inside the critical section. The part of the code base that is not allowed to be subject to arbitrary delays. Massive facepalm.

When you invoke await inside a critical section, you're essentially saying "I hereby accept that this critical section will last an indeterminate amount of time, I am fully aware of what the code I'm calling is doing and I am willing to accept the possibility that the release of the lock may never come, even if my own code is one hundred percent correct, since the await call may contain an explicit or implicit deadlock"

replies(1): >>45781971 #
2. dap ◴[] No.45781971[source]
> So how exactly did they bring about the impossible? They put an await call inside the critical section. The part of the code base that is not allowed to be subject to arbitrary delays. Massive facepalm.

I'm not sure where you got the impression that the example code was where we found the problem. That's a minimal reproducer trying to explain the problem from first principles because most people look at that code and think "that shouldn't deadlock". It uses a Mutex because people are familiar with Mutexes and `sleep` just to control the interleaving of execution. The RFD shows the problem in other examples without Mutexes. Here's a reproducer that futurelocks even though nobody uses `await` with the lock held: https://play.rust-lang.org/?version=stable&mode=debug&editio...

> I was honestly wondering how you could possibly cause this in any sane code base.

The actual issue is linked at the very top of the RFD. In our cases, we had a bounded mpsc channel used to send messages to an actor running in a separate task. That actor was working fine. But the channel did become briefly saturated (i.e., at capacity) at a point where someone tried to send on it via a `tokio::select!` similar to the one in the example.