(rfd.shared.oxide.computer)

421 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0.245s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

lowbloodsugar ◴[01 Nov 25 08:21 UTC] No.45780047[source]▶

>>45774086 (OP) #

I know this is going to sound trite, but “don’t do that”. It’s no different than deciding to poll the win32 event queue inside an a method you executed in response to polling the event queue. Nested shit is always going to cause a bug. I guess each new generation just has to learn.

replies(2): >>45781563 #>>45781912 #

1. dap ◴[01 Nov 25 14:30 UTC] No.45781912[source]▶

>>45780047 #

Don't do ... what, exactly? The RFD answers this more precisely and provides suggestions for alternatives. But it's not very simple because the things that can cause this are all common patterns individually and it's only the confluence (which can be spread across layers of the program) that introduces this problem. In our case, it wasn't a Mutex, but an mpsc channel (that was working correctly! it just got very briefly saturated) and it was 3-4 modules lower in the stack than the code with the `tokio::select!` that induced this.

↑

Futurelock: A subtle risk in async Rust