←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
427 points bcantrill | 2 comments | | HN request time: 0.433s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context
mleonhard ◴[] No.45779198[source]
> // Start a background task that takes the lock and holds it for a few seconds.

Holding a lock while waiting for IO can destroy a system's performance. With async Rust, we can prevent this by making the MutexGuard !Send, so it cannot be held across an await. Specifically, because it is !Send, it cannot be stored in the Future [2], so it must be dropped immediately, freeing the lock. This also prevents Futurelock deadlock.

This is how I wrote safina::sync::Mutex [0]. I did try to make it Send, like Tokio's MutexGuard, but stopped when I realized that it would become very complicated or require unsafe.

> You could imagine an unfair Mutex that always woke up all waiters and let them race to grab the lock again. That would not suffer from risk of futurelock, but it would have the thundering herd problem plus all the liveness issues associated with unfair synchronization primitives.

Thundering herd is when clients overload servers. This simple Mutex has O(n^2) runtime: every task must acquire and release the mutex, which adds all waiting tasks to the scheduler queue. In practice, scheduling a task is very fast (~600ns). As long as polling the lock-mutex-future is fast and you have <500 waiting tasks, then the O(n^2) runtime is fine.

Performance is hard to predict. I wrote Safina using the simplest possible implementations and assumed they would be slow. Then I wrote some micro-benchmarks and found that some parts (like the async Mutex) actually outperform Tokio's complicated versions [1]. I spent days coding optimizations that did not improve performance (work stealing) or even reduced performance (thread affinity). Now I'm hesitant to believe assumptions and predictions about performance, even if they are based on profiling data.

[0] https://docs.rs/safina/latest/safina/sync/struct.MutexGuard....

[1] https://docs.rs/safina/latest/safina/index.html#benchmark

[2] Multi-threaded async executors require futures to be Send.

replies(3): >>45780445 #>>45781201 #>>45783200 #
dvratil ◴[] No.45780445[source]
I would guess this is just to make the explanation of the bug easier.

In real world, the futurelock could occur even with very short locks, it just wouldn't be so deterministic. Having a minimal reproducer that you have to run a thousand times and it will maybe futurelock doesn't really make for a good example :)

replies(1): >>45781488 #
1. imtringued ◴[] No.45781488[source]
>In real world, the futurelock could occur even with very short locks, it just wouldn't be so deterministic.

You have to explain the problem properly then. The problem here has nothing to do with duration whatsoever so don't bring that up. The problem here is that if you acquire a lock, you're inside a critical section. Critical sections have a programming paradigm that is equivalent to writing unsafe Rust. You're not allowed to panic inside unsafe Rust or inside critical sections. It's simply not allowed.

You're also not allowed to interrupt the critical section by something that does not have a hard guarantee that it will finish. This rules out await inside the critical section. You're not allowed to do await. It's simply not allowed. The only thing you're allowed to do is execute an instruction that guarantees that N-1 instructions are left to be executed, where N is a finite number. Alternatively you do the logical equivalent. You have a process that has a known finite bound on how long it will take to execute and you are waiting for that external process.

After that process has finished, you release the lock. Then you return to the scheduler and execute the next future. The next future cannot be blocked because the lock has already been released. It's simply impossible.

You now have to explain how the impossible happened. After all, by using the lock you've declared that you took all possible precautions to avoid interrupting the critical section. If you did not, then you deserve any bugs coming your way. That's just how locks are.

replies(1): >>45781828 #
2. dap ◴[] No.45781828[source]
I think you misunderstand the problem. The only purpose of the sleep in this example is to control interleaving of execution to ensure the problem happens. Here's a version where the background task (the initial lock holder) only runs a bounded number of instructions with the lock held, just as you suggest:

https://play.rust-lang.org/?version=stable&mode=debug&editio...

It still futurelocks.

> After that process has finished, you release the lock. Then you return to the scheduler and execute the next future. The next future cannot be blocked because the lock has already been released. It's simply impossible.

This is true with threads and with tasks that only ever poll futures sequentially. It is not true in the various cases mentioned in this RFD (notably `tokio::select!`, but also others). Intuitively: when you have one task polling on multiple futures concurrently, you're essentially adding another layer to the scheduler (kernel thread scheduler, tokio task scheduler, now some task is acting as its own future scheduler). The problem is it's surprisingly easy to (1) not realize that and (2) accidentally have that "scheduler" not poll the next runnable future and then get stuck, just like if the kernel scheduler didn't wake up a runnable thread.