←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
421 points bcantrill | 9 comments | | HN request time: 1.24s | source | bottom

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

1. Matthias247 ◴[] No.45777191[source]
As far as I remember from building these things with others within the async rust ecosystem (hey Eliza!) was that there was a certain tradeoff: if you wouldn’t be able to select on references, you couldn’t run into this issue. However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

I fully agree that this and the cancellation issues discussed before can lead to surprising issues even to seasoned Rust experts. But I’m not sure what really can be improved under the main operating model of async rust (every future can be dropped).

But compared to working with callbacks the amount of surprising things is still rather low :)

replies(2): >>45777316 #>>45777340 #
2. octoberfranklin ◴[] No.45777316[source]
> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state. Yes, this is awkward, but it's the only logically coherent way. There is probably some macro voodoo that makes it ergonomic. But even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

For the real root cause: https://news.ycombinator.com/item?id=45777234

replies(1): >>45777364 #
3. mycoliza ◴[] No.45777340[source]
Indeed, you are correct (and hi Matthias!). After we got to the bottom of this deadlock, my coworkers and I had one of our characteristic "how could we have prevented this?" conversations, and reached the somewhat sad conclusion that actually, there was basically nothing we could easily blame for this. All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too. Not being able to do that would make it pretty hard to express a lot of things that many applications might reasonably want to express, as you described. I discussed this a bit in this comment[1] as well.

On the other hand, it also wasn't our coworker who had written the code where we found the bug who was to blame, either. It wasn't a case of sloppy programming; he had done everything correctly and put the pieces together the way you were supposed to. All the pieces worked as they were supposed to, and his code seemed to be using them correctly, but the interaction of these pieces resulted in a deadlock that it would have been very difficult for him to anticipate.

So, our conclusion was, wow, this just kind of sucks. Not an indictment of async Rust as a whole, but an unfortunate emergent behavior arising from an interaction of individually well-designed pieces. Just something you gotta watch out for, I guess. And that's pretty sad to have to admit.

[1] https://news.ycombinator.com/item?id=45776868

replies(2): >>45781082 #>>45781406 #
4. mycoliza ◴[] No.45777364[source]
> No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state.

How does that prevent this kind of deadlock? If the owned future has acquired a mutex, and you return that future from the select so that it might be polled again, and the user assigns it to a variable, then the future that has acquired the mutex but has not completed is still not dropped. This is basically the same as polling an `&mut future`, but with more steps.

replies(1): >>45777467 #
5. octoberfranklin ◴[] No.45777467{3}[source]
> How does that prevent this kind of deadlock?

Like I said, it doesn't:

> even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

It fixes this:

> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

If you want to fix the root cause, see https://news.ycombinator.com/item?id=45777234

6. kibwen ◴[] No.45781082[source]
> All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too.

But it still suggests that `tokio::select` is too powerful. You don't need to get rid of `tokio::select`, you just need to consider creating a less powerful mechanism that doesn't risk exhibiting this problem. Then you could use that less powerful mechanism in the places where you don't need the full power of `tokio::select`, thereby reducing the possible places where this bug could arise. You don't need to get rid of the fully powerful mechanism, you just need to make it optional.

replies(1): >>45781622 #
7. imtringued ◴[] No.45781406[source]
You must guarantee forward progress inside your critical sections and that means your critical sections are guaranteed to finish. How hard is that to understand? From my perspective this situation was basically guaranteed to happen.

There is no real difference between a deadlock caused by a single thread acquiring the same non reentrant lock twice and a single thread with two virtual threads where the the first thread calls the code of the second thread inside the critical section. They are the same type of deadlock caused by the same fundamental problem.

>Remember too that the Mutex could be buried beneath several layers of function calls in different modules or packages. It could require looking across many layers of the stack at once to be able to see the problem.

That is a fundamental property of mutexes. Whenever you have a critical section, you must be 100% aware of every single line of code inside that critical section.

>There’s no one abstraction, construct, or programming pattern we can point to here and say "never do this". Still, we can provide some guidelines.

The programming pattern you're looking for is guaranteeing forward progress inside critical sections. Only synchronous code is allowed to be executed inside a critical section. The critical section must be as small as possible. It must never be interrupted, ever.

Sounds like a pain in the ass, right? That's right, locks are a pain in the ass.

8. tux3 ◴[] No.45781622{3}[source]
I feel like select!() is a good case study because the common future timeout use-case maps pretty closely to a select!(), so there is only so much room to weaken it.

The ways I can think of for making select!() safer all involve runtime checks and allocations (possibly this is just a failure of my imagination!). But if that's the case, I would find it bothersome if our basic async building blocks like select/timeout in practice turn out to require more expensive runtime checks or allocations to be safe.

We have a point in the async design space where we pay a complexity price, but in exchange we get really neat zero-cost futures. But I feel like we only get our money's worth if we can actually statically prove that correct use won't deadlock, without the expensive runtime checks! Otherwise, can we afford to spend all this complexity budget?

The implementation of select!() does feel way too powerful in a way (it's a whole mini scheduler that creates implicit future dependencies hidden from the rest of the executor, and then sometimes this deadlocks!). But the need is pretty foundational, it shows up everywhere as a building block.

replies(1): >>45781785 #
9. kibwen ◴[] No.45781785{4}[source]
It feels to me like there's plenty of design space to explore. Sure, it's possible to view "selection" as a basic building block, but even that is insufficiently precise IMO. There's a reason that Javascript provides all of Promise.any and Promise.all and Promise.allSettled and Promise.race. Selection isn't just a single building block, it's an entire family of building blocks with distinct semantics.