←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
421 points bcantrill | 1 comments | | HN request time: 0.252s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

1. ajross ◴[] No.45781548[source]
Trying to get my head around this. It seems like the "rootest" cause here is a paradigm clash between lock fairness and async futures.

A fair lock[1] is designed to wake up the longest-waiting task, since it got to the queue first and might otherwise be starved if the algorithm doesn't guarantee it gets to the head of the queue.

BUT CRITICALLY: a future isn't a task. It's not a thread, it's not guaranteed to "run". It's just a flag that gets set somewhere. But it can consume that wakeup nonetheless. So it's possible to "wake up"[2] a future that isn't actually being polled, and won't be, until something else that is waiting on the resource that just tried to wake it up.

I don't see that these concepts are ever going to work together. You can't have locks generating wakeup events that aren't consumed. If you're going to use them with async, you need to do something like a broadcast to guarantee that every waiter sees an event.

Stated differently: the lock is signalling an edge-triggered interrupt, but rust async demands level sensititivity.

[1] In one sense of fair. There are others, like "switch now" vs. "defer context switch", but that's not relevant here.

[2] Which doesn't actually wake anything up, thus the bug.