←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
421 points bcantrill | 7 comments | | HN request time: 0.967s | source | bottom

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

1. mleonhard ◴[] No.45779198[source]
> // Start a background task that takes the lock and holds it for a few seconds.

Holding a lock while waiting for IO can destroy a system's performance. With async Rust, we can prevent this by making the MutexGuard !Send, so it cannot be held across an await. Specifically, because it is !Send, it cannot be stored in the Future [2], so it must be dropped immediately, freeing the lock. This also prevents Futurelock deadlock.

This is how I wrote safina::sync::Mutex [0]. I did try to make it Send, like Tokio's MutexGuard, but stopped when I realized that it would become very complicated or require unsafe.

> You could imagine an unfair Mutex that always woke up all waiters and let them race to grab the lock again. That would not suffer from risk of futurelock, but it would have the thundering herd problem plus all the liveness issues associated with unfair synchronization primitives.

Thundering herd is when clients overload servers. This simple Mutex has O(n^2) runtime: every task must acquire and release the mutex, which adds all waiting tasks to the scheduler queue. In practice, scheduling a task is very fast (~600ns). As long as polling the lock-mutex-future is fast and you have <500 waiting tasks, then the O(n^2) runtime is fine.

Performance is hard to predict. I wrote Safina using the simplest possible implementations and assumed they would be slow. Then I wrote some micro-benchmarks and found that some parts (like the async Mutex) actually outperform Tokio's complicated versions [1]. I spent days coding optimizations that did not improve performance (work stealing) or even reduced performance (thread affinity). Now I'm hesitant to believe assumptions and predictions about performance, even if they are based on profiling data.

[0] https://docs.rs/safina/latest/safina/sync/struct.MutexGuard....

[1] https://docs.rs/safina/latest/safina/index.html#benchmark

[2] Multi-threaded async executors require futures to be Send.

replies(3): >>45780445 #>>45781201 #>>45783200 #
2. dvratil ◴[] No.45780445[source]
I would guess this is just to make the explanation of the bug easier.

In real world, the futurelock could occur even with very short locks, it just wouldn't be so deterministic. Having a minimal reproducer that you have to run a thousand times and it will maybe futurelock doesn't really make for a good example :)

replies(1): >>45781488 #
3. yxhuvud ◴[] No.45781201[source]
Work stealing is more a technique to function better when architecture is pessimal (think mixing slow and fast tasks in one queue) than something that make things go faster in general. It also tend to shuffle around complexity a bit, in ways that are sometimes nice.

Same thing with task preemption, though that one has less organisatorial impact.

In general, getting something to perform well enough on specific tasks is a lot easier than performing well enough on tasks in general. At the same time, most tasks have kinda specific needs when you start looking at them..

4. imtringued ◴[] No.45781488[source]
>In real world, the futurelock could occur even with very short locks, it just wouldn't be so deterministic.

You have to explain the problem properly then. The problem here has nothing to do with duration whatsoever so don't bring that up. The problem here is that if you acquire a lock, you're inside a critical section. Critical sections have a programming paradigm that is equivalent to writing unsafe Rust. You're not allowed to panic inside unsafe Rust or inside critical sections. It's simply not allowed.

You're also not allowed to interrupt the critical section by something that does not have a hard guarantee that it will finish. This rules out await inside the critical section. You're not allowed to do await. It's simply not allowed. The only thing you're allowed to do is execute an instruction that guarantees that N-1 instructions are left to be executed, where N is a finite number. Alternatively you do the logical equivalent. You have a process that has a known finite bound on how long it will take to execute and you are waiting for that external process.

After that process has finished, you release the lock. Then you return to the scheduler and execute the next future. The next future cannot be blocked because the lock has already been released. It's simply impossible.

You now have to explain how the impossible happened. After all, by using the lock you've declared that you took all possible precautions to avoid interrupting the critical section. If you did not, then you deserve any bugs coming your way. That's just how locks are.

replies(1): >>45781828 #
5. dap ◴[] No.45781828{3}[source]
I think you misunderstand the problem. The only purpose of the sleep in this example is to control interleaving of execution to ensure the problem happens. Here's a version where the background task (the initial lock holder) only runs a bounded number of instructions with the lock held, just as you suggest:

https://play.rust-lang.org/?version=stable&mode=debug&editio...

It still futurelocks.

> After that process has finished, you release the lock. Then you return to the scheduler and execute the next future. The next future cannot be blocked because the lock has already been released. It's simply impossible.

This is true with threads and with tasks that only ever poll futures sequentially. It is not true in the various cases mentioned in this RFD (notably `tokio::select!`, but also others). Intuitively: when you have one task polling on multiple futures concurrently, you're essentially adding another layer to the scheduler (kernel thread scheduler, tokio task scheduler, now some task is acting as its own future scheduler). The problem is it's surprisingly easy to (1) not realize that and (2) accidentally have that "scheduler" not poll the next runnable future and then get stuck, just like if the kernel scheduler didn't wake up a runnable thread.

6. ufmace ◴[] No.45783200[source]
Considering this issue did also make me think - maybe the real footgun here is the async mutex. I think a better "rule" to avoid this issue might be something like, don't just use the tokio async mutex by default just because it's there and you're in an async function; instead default to a sync mutex that errors when held across awaits and think very hard about what you're really doing before you switch to the async one.
replies(1): >>45785979 #
7. ufmace ◴[] No.45785979[source]
Actually I think I might be a little misguided here - confusing a mutex with an awaitable lock method versus blocking, and a mutex whose LockGuard is Send and can be held across other await points.

To clarify, I do still think it's probably wise to prefer using a mutex whose LockGuard is not Send. If you're in an async context though, it seems clearly preferable to use a mutex that lets you await on lock instead of possibly blocking. Looks like that's what that Safina gives you.

It does bring to mind the point though - does it really make sense to call all of these things Mutexes? Most Mutexes, including the one in std, seem relatively simplistic, with no provision for exactly what happens if multiple threads/tasks are waiting to acquire the lock. As if they're designed for the case of, it's probably rare to never for multiple threads to actually need this thing at once, but we have to guard against it just to be certain. The case of this resource is in high demand by a bunch of threads, we expect there to be a lot of time spent by a lot of threads waiting to get the lock, so it's actually important which lock requesters actually get the lock in what order, seems different enough that it maybe ought to have a different name and more flexibility and selection as to what algorithm is being used to control the lock order.