←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
427 points bcantrill | 1 comments | | HN request time: 0.298s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context
Matthias247 ◴[] No.45777191[source]
As far as I remember from building these things with others within the async rust ecosystem (hey Eliza!) was that there was a certain tradeoff: if you wouldn’t be able to select on references, you couldn’t run into this issue. However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

I fully agree that this and the cancellation issues discussed before can lead to surprising issues even to seasoned Rust experts. But I’m not sure what really can be improved under the main operating model of async rust (every future can be dropped).

But compared to working with callbacks the amount of surprising things is still rather low :)

replies(2): >>45777316 #>>45777340 #
mycoliza ◴[] No.45777340[source]
Indeed, you are correct (and hi Matthias!). After we got to the bottom of this deadlock, my coworkers and I had one of our characteristic "how could we have prevented this?" conversations, and reached the somewhat sad conclusion that actually, there was basically nothing we could easily blame for this. All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too. Not being able to do that would make it pretty hard to express a lot of things that many applications might reasonably want to express, as you described. I discussed this a bit in this comment[1] as well.

On the other hand, it also wasn't our coworker who had written the code where we found the bug who was to blame, either. It wasn't a case of sloppy programming; he had done everything correctly and put the pieces together the way you were supposed to. All the pieces worked as they were supposed to, and his code seemed to be using them correctly, but the interaction of these pieces resulted in a deadlock that it would have been very difficult for him to anticipate.

So, our conclusion was, wow, this just kind of sucks. Not an indictment of async Rust as a whole, but an unfortunate emergent behavior arising from an interaction of individually well-designed pieces. Just something you gotta watch out for, I guess. And that's pretty sad to have to admit.

[1] https://news.ycombinator.com/item?id=45776868

replies(2): >>45781082 #>>45781406 #
1. imtringued ◴[] No.45781406[source]
You must guarantee forward progress inside your critical sections and that means your critical sections are guaranteed to finish. How hard is that to understand? From my perspective this situation was basically guaranteed to happen.

There is no real difference between a deadlock caused by a single thread acquiring the same non reentrant lock twice and a single thread with two virtual threads where the the first thread calls the code of the second thread inside the critical section. They are the same type of deadlock caused by the same fundamental problem.

>Remember too that the Mutex could be buried beneath several layers of function calls in different modules or packages. It could require looking across many layers of the stack at once to be able to see the problem.

That is a fundamental property of mutexes. Whenever you have a critical section, you must be 100% aware of every single line of code inside that critical section.

>There’s no one abstraction, construct, or programming pattern we can point to here and say "never do this". Still, we can provide some guidelines.

The programming pattern you're looking for is guaranteeing forward progress inside critical sections. Only synchronous code is allowed to be executed inside a critical section. The critical section must be as small as possible. It must never be interrupted, ever.

Sounds like a pain in the ass, right? That's right, locks are a pain in the ass.