Most active commenters
  • IshKebab(3)

←back to thread

Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)
421 points bcantrill | 27 comments | | HN request time: 0.83s | source | bottom

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

1. hitekker ◴[] No.45777349[source]
Skimming through, this document feels thorough and transparent. Clearly, a hard lesson learned. The footnotes, in particular, caught my eye https://rfd.shared.oxide.computer/rfd/397#_external_referenc...

> Why does this situation suck? It’s clear that many of us haven’t been aware of cancellation safety and it seems likely there are many cancellation issues all over Omicron. It’s awfully stressful to find out while we’re working so hard to ship a product ASAP that we have some unknown number of arbitrarily bad bugs that we cannot easily even find. It’s also frustrating that this feels just like the memory safety issues in C that we adopted Rust to get away from: there’s some dynamic property that the programmer is responsible for guaranteeing, the compiler is unable to provide any help with it, the failure mode for getting it wrong is often undebuggable (by construction, the program has not done something it should have, so it’s not like there’s a log message or residual state you could see in a debugger or console), and the failure mode for getting it wrong can be arbitrarily damaging (crashes, hangs, data corruption, you name it). Add on that this behavior is apparently mostly undocumented outside of one macro in one (popular) crate in the async/await ecosystem and yeah, this is frustrating. This feels antithetical to what many of us understood to be a core principle of Rust, that we avoid such insidious runtime behavior by forcing the programmer to demonstrate at compile-time that the code is well-formed

replies(2): >>45778263 #>>45779161 #
2. rtpg ◴[] No.45778263[source]
I guess one big question here is whether there's a higher layer abstraction that is available to wrap around patterns to avoid this.

It does feel like there's still generally possibilities of deadlocks in Rust concurrency right? I understand the feeling here that it feels like ... uhh... RAII-style _something_ should be preventing this, because it feels like statically we should be able to identify this issue in this simple case.

I still have a hard time understanding how much of this is incidental and how much of this is just downstream of the Rust/Tokio model not having enough to work on here.

replies(3): >>45778773 #>>45780272 #>>45780792 #
3. embedding-shape ◴[] No.45778773[source]
> I guess one big question here is whether there's a higher layer abstraction that is available to wrap around patterns to avoid this.

Something like Actors, on top of Tokio, would be one way: https://ryhl.io/blog/actors-with-tokio/

replies(2): >>45779284 #>>45780277 #
4. csande17 ◴[] No.45779161[source]
In case anyone else was confused: the link/quote in this comment are from the previous "async cancellation issue" write-up, which describes a situation where you "drop" a future: the code in the async function stops running, and all the destructors on its local variables are executed.

The new write-up from OP is that you can "forget" a future (or just hold onto it longer than you meant to), in which case the code in the async function stops running but the destructors are NOT executed.

Both of these behaviors are allowed by Rust's fairly narrow definition of "safety" (which allows memory leaks, deadlocks, infinite loops, and, obviously, logic bugs), but I can see why you'd be disappointed if you bought into the broader philosophy of Rust making it easier to write correct software. Even the Rust team themselves aren't immune -- see the "leakpocalypse" from before 1.0.

replies(3): >>45780439 #>>45780713 #>>45780756 #
5. smallstepforman ◴[] No.45779284{3}[source]
I love Actors and have used them professionally for over 6 years (C++). However to solve real world problems I have had to introduce “locks” to the Actor framework to support various scenarios. With my home-grown actor library, this was trivial to add, however for some 3rd party actor libraries, ideology is dominant and the devs refuse to add such a purity-breaking feature to their actor framework, and hence I cannot use their library for real-world code.
replies(2): >>45779753 #>>45779887 #
6. eklavya ◴[] No.45779753{4}[source]
That sounds interesting, what kind of actor use cases would require adding locks to actors?
7. logicchains ◴[] No.45779887{4}[source]
What scenario requires locks that can't be solved by just having a single actor that owns the resource and controls access?
replies(2): >>45779985 #>>45784138 #
8. rtpg ◴[] No.45779985{5}[source]
I would imagine that in... "soft realtime" might be much but in performance sensitive scenarios the actual cost to having some coordination code in that space might start mattering.

Maybe actor abstractions end up compiling away fairly nicely in Rust though!

9. gf000 ◴[] No.45780272[source]
> It does feel like there's still generally possibilities of deadlocks in Rust concurrency right?

I mean, is there any generic computation model where you can't have deadlocks? Even with stuff like actors you can trivially have cycles and now your blocking primitive is just different (not CPU-level), and we call it a livelock, but it's fundamentally the same.

10. gf000 ◴[] No.45780277{3}[source]
Then you just replace deadlocks with livelocks, the fundamental problem AFAIK can't be avoided.
11. formerly_proven ◴[] No.45780439[source]
async rust continues to strike me as half-baked and too complex, if you’re developing an application (as opposed to some high performance utility like e.g. a data plane component) just use threads, they’re plenty cheap and not even half as messy.
replies(3): >>45781030 #>>45783781 #>>45784006 #
12. nialv7 ◴[] No.45780713[source]
Yeah, Rust mostly just eliminates memory safety and data race problems, which is an enormous improvement compared to what we had previously. Unfortunately right now if you really want to write software that's guaranteed to be correct, there's not alternative to formal verification.
replies(3): >>45780785 #>>45783966 #>>45785503 #
13. zozbot234 ◴[] No.45780756[source]
> The new write-up from OP is that you can "forget" a future (or just hold onto it longer than you meant to), in which case the code in the async function stops running but the destructors are NOT executed.

If you're relying for global correctness on some future being continuously polled, you should just be spawning async tasks instead. Then the runtime takes care of the polling for you, you can't just neglect it - unless the whole thread is blocked, which really shouldn't happen. "Futures" are intentionally a lower-level abstraction than "async runtime tasks".

replies(1): >>45781456 #
14. IshKebab ◴[] No.45780785{3}[source]
Minor nit: formal verification doesn't guarantee correctness.
15. IshKebab ◴[] No.45780792[source]
The Fuchsia guys use the trait system to enforce a global mutex locking order, which can statically prevent deadlocks due to two threads locking mutexes that they are both waiting for.

Doesn't help in this case, but it does suggest that we might be able to do better.

replies(1): >>45782021 #
16. kibwen ◴[] No.45781030{3}[source]
Async Rust is as complex as it needs to be given its constraints. But I wholeheartedly agree with you that people need to treat threads (especially scoped ones) as the default concurrency primitive. My intuition is that experience with other languages has led people astray; in most languages threads are a nightmare and/or async is the default or only way to achieve concurrency, but threads in Rust are absolutely divine by comparison. Async should only be used when you have a good reason that threads don't suffice.
replies(2): >>45781321 #>>45782204 #
17. redman25 ◴[] No.45781321{4}[source]
It's a good idea in concept but tons of popular libraries use async which makes it difficult to avoid. Want to do anything with a web server or sending requests, most likely async for popular libraries.
replies(1): >>45781538 #
18. ◴[] No.45781456{3}[source]
19. galangalalgol ◴[] No.45781538{5}[source]
Yeah, the nom asynch nats client got deprecated for instance. It really is a shame, because very few projects will ever scale large enough to need asynch, and apart from things like this, there are costs in portability and supply chain attack surface area when you bring in tokio.
20. thenewwazoo ◴[] No.45782021{3}[source]
Any chance you could dig up a link to that code? I’m curious to learn more
replies(1): >>45782132 #
21. IshKebab ◴[] No.45782132{4}[source]
https://lwn.net/Articles/995814/
22. mwcampbell ◴[] No.45782204{4}[source]
In the spirit of "every non-trivial program will expand until ...", I think preemptively choosing async for anything much more complex than a throwaway script might be justified. In this case, the relevant thing isn't performance or expected number of concurrent users/connections, but whether the program is likely to become or include a non-trivial state machine. My primary influence on this topic is this post from @sunshowers: https://sunshowers.io/posts/nextest-and-tokio/
23. xmodem ◴[] No.45783781{3}[source]
I work on an application that has various components split between sync and async rust. For certain tasks, async actually makes things a lot simpler.
24. pjmlp ◴[] No.45783966{3}[source]
Only if the data structures aren't exposed to outside of the program, in which case, Rust cannot guarantee safety from data race problems caused by OS IPC mechanisms like memory mapped data, shared memory segments or DMA buffers, accessed by external events.
25. pjmlp ◴[] No.45784006{3}[source]
The main issue was shipping it without proper runtime support, and even nowadays async/await is synonym with Tokio.

Look at .NET, it took almost a decade to sort out async/await across all platform and language layers, and even today there are a few gotchas.

https://github.com/gerardo-lijs/Asynchronous-Programming

Rust still has a similar path to trail, with async traits, better Pin ergonomics, async lambdas, async loops,..... (yes I know some of them have been dealt with).

26. smallstepforman ◴[] No.45784138{5}[source]
Any scenario where you have to atomically update 2 actors. To use a simple analogy for illustrative purposes, transferring money between 2 accounts, you need to lock both actors before incrementing/decrementing. Because in the real world, the accounts can change from other pending parallel transactions and edits. Handshakes are very error prone. Lock the actor, do the critical transaction, unlock.

In a rationale world, this works. In a prejudiced world, devs fight against locks in actor models.

Hence why I had to roll my own …

27. dap ◴[] No.45785503{3}[source]
I would say it can go further than that: Rust enables you to construct many APIs in a way that can’t be misused. It’s not at all unique in this way, but compared with C or Go or the like, you can encode so many more constraints in types.