Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

431 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0.211s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

hitekker ◴[31 Oct 25 22:25 UTC] No.45777349[source]▶

>>45774086 (OP) #

Skimming through, this document feels thorough and transparent. Clearly, a hard lesson learned. The footnotes, in particular, caught my eye https://rfd.shared.oxide.computer/rfd/397#_external_referenc...

> Why does this situation suck? It’s clear that many of us haven’t been aware of cancellation safety and it seems likely there are many cancellation issues all over Omicron. It’s awfully stressful to find out while we’re working so hard to ship a product ASAP that we have some unknown number of arbitrarily bad bugs that we cannot easily even find. It’s also frustrating that this feels just like the memory safety issues in C that we adopted Rust to get away from: there’s some dynamic property that the programmer is responsible for guaranteeing, the compiler is unable to provide any help with it, the failure mode for getting it wrong is often undebuggable (by construction, the program has not done something it should have, so it’s not like there’s a log message or residual state you could see in a debugger or console), and the failure mode for getting it wrong can be arbitrarily damaging (crashes, hangs, data corruption, you name it). Add on that this behavior is apparently mostly undocumented outside of one macro in one (popular) crate in the async/await ecosystem and yeah, this is frustrating. This feels antithetical to what many of us understood to be a core principle of Rust, that we avoid such insidious runtime behavior by forcing the programmer to demonstrate at compile-time that the code is well-formed

replies(2): >>45778263 #>>45779161 #

rtpg ◴[01 Nov 25 00:42 UTC] No.45778263[source]▶

>>45777349 #

I guess one big question here is whether there's a higher layer abstraction that is available to wrap around patterns to avoid this.

It does feel like there's still generally possibilities of deadlocks in Rust concurrency right? I understand the feeling here that it feels like ... uhh... RAII-style _something_ should be preventing this, because it feels like statically we should be able to identify this issue in this simple case.

I still have a hard time understanding how much of this is incidental and how much of this is just downstream of the Rust/Tokio model not having enough to work on here.

replies(3): >>45778773 #>>45780272 #>>45780792 #

embedding-shape ◴[01 Nov 25 02:25 UTC] No.45778773[source]▶

>>45778263 #

> I guess one big question here is whether there's a higher layer abstraction that is available to wrap around patterns to avoid this.

Something like Actors, on top of Tokio, would be one way: https://ryhl.io/blog/actors-with-tokio/

replies(2): >>45779284 #>>45780277 #

smallstepforman ◴[01 Nov 25 04:36 UTC] No.45779284[source]▶

>>45778773 #

I love Actors and have used them professionally for over 6 years (C++). However to solve real world problems I have had to introduce “locks” to the Actor framework to support various scenarios. With my home-grown actor library, this was trivial to add, however for some 3rd party actor libraries, ideology is dominant and the devs refuse to add such a purity-breaking feature to their actor framework, and hence I cannot use their library for real-world code.

replies(2): >>45779753 #>>45779887 #

logicchains ◴[01 Nov 25 07:41 UTC] No.45779887[source]▶

>>45779284 #

What scenario requires locks that can't be solved by just having a single actor that owns the resource and controls access?

replies(2): >>45779985 #>>45784138 #

1. smallstepforman ◴[01 Nov 25 18:38 UTC] No.45784138[source]▶

>>45779887 #

Any scenario where you have to atomically update 2 actors. To use a simple analogy for illustrative purposes, transferring money between 2 accounts, you need to lock both actors before incrementing/decrementing. Because in the real world, the accounts can change from other pending parallel transactions and edits. Handshakes are very error prone. Lock the actor, do the critical transaction, unlock.

In a rationale world, this works. In a prejudiced world, devs fight against locks in actor models.

Hence why I had to roll my own …

↑