Futurelock: A subtle risk in async Rust

(rfd.shared.oxide.computer)

431 points bcantrill | 1 comments | 31 Oct 25 16:49 UTC | HN request time: 0.199s | source

This RFD describes our distillation of a really gnarly issue that we hit in the Oxide control plane.[0] Not unlike our discovery of the async cancellation issue[1][2][3], this is larger than the issue itself -- and worse, the program that hits futurelock is correct from the programmer's point of view. Fortunately, the surface area here is smaller than that of async cancellation and the conditions required to hit it can be relatively easily mitigated. Still, this is a pretty deep issue -- and something that took some very seasoned Rust hands quite a while to find.

[0] https://github.com/oxidecomputer/omicron/issues/9259

[1] https://rfd.shared.oxide.computer/rfd/397

[2] https://rfd.shared.oxide.computer/rfd/400

[3] https://www.youtube.com/watch?v=zrv5Cy1R7r4

Show context

jacquesm ◴[31 Oct 25 20:43 UTC] No.45776483[source]▶

>>45774086 (OP) #

If any rust designers are lurking about here: what made you decide to go for the async design pattern instead of the actor pattern, which - to me at least - seems so much cleaner and so much harder to get wrong?

Ever since I started using Erlang it felt like I finally found 'the right way' when before then I did a lot of work with sockets and asynchronous worker threads. But even though it usually worked as advertised it had a large number of really nasty pitfalls which the actor model seemed to - effortlessy - step aside.

So I'm seriously wondering what the motivation was. I get why JS uses async, there isn't any other way there, by the time they added async it was too late to change the fundamentals of the language to such a degree. But rust was a clean slate.

replies(5): >>45776498 #>>45776569 #>>45776637 #>>45776798 #>>45777596 #

mdasen ◴[31 Oct 25 22:53 UTC] No.45777596[source]▶

>>45776483 #

I'd recommend watching this video: https://www.infoq.com/presentations/rust-2019/; and reading this: https://tokio.rs/blog/2020-04-preemption

I'm not the right person to write a tl;dr, but here goes.

For actors, you're basically talking about green threads. Rust had a hard constraint that calls to C not have overhead and so green threads were out. C is going to expect an actual stack so you have to basically spin up a real stack from your green-thread stack, call the C function, then translate it back. I think Erlang also does some magic where it will move things to a separate thread pool so that the C FFI can block without blocking the rest of your Erlang actors.

Generally, async/await has lower overhead because it gets compiled down to a state machine and event loop. Languages like Go and Erlang are great, but Rust is a systems programming language looking for zero cost abstractions rather than just "it's fast."

To some extent, you can trade overhead for ease. Garbage collectors are easy, but they come with overhead compared to Rust's borrow checker method or malloc/free.

To an extent it's about tradeoffs and what you're trying to make. Erlang and Go were trying to build something different where different tradeoffs made sense.

EDIT: I'd also note that before Go introduced preemption, it too would have "pitfalls". If a goroutine didn't trigger a stack reallocation (like function calls that would make it grow the stack) or do something that would yield (like blocking IO), it could starve other goroutines. Now Go does preemption checks so that the scheduler can interrupt hot loops. I think Erlang works somewhat similarly to Rust in scheduling in that its actors have a certain budget, every function call decrements their budget, and when they run of of budget they have to yield back to the scheduler.

replies(1): >>45779665 #

jacquesm ◴[01 Nov 25 06:33 UTC] No.45779665[source]▶

>>45777596 #

Indeed, in Erlang the budget is counted in 'reductions'. Technically Erlang uses the BEAM as a CPU with some nifty extra features which allow you to pretend that you are pre-empting a process when in fact it is the interpreter of the bytecode that does the work and there are no interrupts involved. Erlang would not be able to do this if the Erlang input code was translated straight to machine instructions.

But Go does compile down to machine code, so that's why until it did pre-emption it needed that yield or hook.

Come to think of it: it is strange that such quota management isn't built into the CPU itself. It seems like a very logical thing to do. Instead we rely on hardware interrupts for pre-emption and those are pretty fickle. It also means that there is a fixed system wide granularity for scheduling.

replies(1): >>45781121 #

yxhuvud ◴[01 Nov 25 12:26 UTC] No.45781121[source]▶

>>45779665 #

Fickle? Pray tell, when the OS switch your thread for another thread, in what way does that fickleness show?

replies(1): >>45787612 #

1. jacquesm ◴[02 Nov 25 03:18 UTC] No.45787612[source]▶

>>45781121 #

I take it you've never actually interfaced directly with hardware?

Interrupts are at the most basic level an electrical signal to the CPU to tell it to load a new address into the next instruction pointer after pushing the current one and possibly some other registers onto the stack. That means you don't actually know when they will happen and they are transparent to the point that those two instructions that you put right after one another are possibly detoured to do an unknown amount of work in some other place.

Any kind of side effect from that detour (time spent, changes made to the state of the machine) has the potential to screw up the previously deterministic path that you were on.

To make matters worse, there are interrupts that can interrupt the detour in turn. There are ways in which you can tell the CPU 'not now' and there are ways in which those can be overridden. If you are lucky you can uniquely identify the device that caused the interrupt to be triggered. But this isn't always the case and given the sensitivity of the inputs involved it isn't rare at all that your interrupt will trigger without any ground to do so. If that happens and the ISR is not written with that particular idea in mind you may end up with a system in an undefined state.

Interrupts are a very practical mechanism. But they're also a nightmare to deal with in the otherwise orderly affairs of computing and troubleshooting interrupt related issues can eat up days, weeks or even months if you are really unlucky.

↑