Most active commenters
  • vlovich123(10)
  • sunshowers(6)
  • maplant(4)
  • Spivak(3)
  • rcxdude(3)

←back to thread

517 points bkolobara | 41 comments | | HN request time: 0.718s | source | bottom
1. Spivak ◴[] No.45041771[source]
How do you encode the locking issue in the type system, it seems magical? Can you just never hold any locks when calling await, is it smart enough to know that this scheduler might move work between threads?
replies(5): >>45041806 #>>45041833 #>>45041852 #>>45041891 #>>45041898 #
2. jvanderbot ◴[] No.45041806[source]
Yes Rust has ways of verifying single access to locks through the borrow checker and lifetimes.
replies(1): >>45041860 #
3. vlovich123 ◴[] No.45041833[source]
Presumably the author is using tokio which requires the future constructed (e.g the async function) to be Send (either because of the rules of Rust or annotated as Send) since tokio is a work-stealing runtime and any thread might end up executing a given future (or even start executing and then during a pause move it for completion on another thread). std::sync::MutexGuard intentionally isn't annotated with Send because there are platforms that require the acquiring thread be the one to unlock the mutex.

One caveat though - using a normal std Mutex within an async environment is an antipattern and should not be done - you can cause all sorts of issues & I believe even deadlock your entire code. You should be using tokio sync primitives (e.g. tokio Mutex) which can yield to the reactor when it needs to block. Otherwise the thread that's running the future blocks forever waiting for that mutex and that reactor never does anything else which isn't how tokio is designed).

So the compiler is warning about 1 problem, but you also have to know to be careful to know not to call blocking functions in an async function.

replies(6): >>45041892 #>>45041938 #>>45041964 #>>45042014 #>>45042145 #>>45042479 #
4. synalx ◴[] No.45041852[source]
More the latter. The lock guard is not `Send`, so holding it across the await point makes the `impl Future` returned by the async function also not `Send`. Therefore it can't be passed to a scheduler which does work stealing, since that scheduler is typed to require futures to be `Send`.
5. Spivak ◴[] No.45041860[source]
Well yes but this seems more complicated. If the code was executed on a single thread it would be correct. The compiler somehow knows that await might move the work to another thread in which the code would still be correct if it weren't for the pesky undefined behavior. At all points it seems like it would be single access and yet it catches this.
replies(3): >>45042126 #>>45042240 #>>45042247 #
6. maplant ◴[] No.45041892[source]
> using a normal std Mutex within an async environment is an antipattern and should not be done

This is simply not true, and the tokio documentation says as much:

"Contrary to popular belief, it is ok and often preferred to use the ordinary Mutex from the standard library in asynchronous code."

https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#wh...

replies(2): >>45042193 #>>45044049 #
7. bkolobara ◴[] No.45041891[source]
Yes, if you use a scheduler that doesn't move work between threads, it will not require the task to be Send, and the example code would compile.

Rust can use that type information and lifetimes to figure out when it's safe and when not.

8. mrtracy ◴[] No.45041898[source]
Rust uses the traits “Send” and “Sync” to encode this information, there is a lot of special tooling in the compiler around these.

A type is “Send” if it can be moved from one thread to another, it is “Sync” if it can be simultaneously accessed from multiple threads.

These traits are automatically applied whenever the compiler knows it is safe to do so. In cases where automatic application is not possible, the developer can explicitly declare a type to have these traits, but doing so is unsafe (requires the ‘unsafe’ keyword and everything that entails).

You can read more at rustinomicon, if you are interested: https://doc.rust-lang.org/nomicon/send-and-sync.html

9. dilawar ◴[] No.45041938[source]
> One caveat though - using a normal std Mutex within an async environment is an antipattern and should not be done - you can cause all sorts of issues & I believe even deadlock your entire code.

True. I used std::mutex with tokio and after a few days my API would not respond unless I restarted the container. I was under the impression that if it compiles, it's gonna just work (fearless concurrency) which is usually the case.

replies(2): >>45042113 #>>45042735 #
10. Spivak ◴[] No.45041964[source]
Thank you! It seems so simple in hindsight to have a type that means "can be moved to another thread safely." But the fact that using a Mutex in your function changes the "type" is really novel. It becoming Send once the lock is released before await is just fantastic.
replies(2): >>45042232 #>>45042345 #
11. rcxdude ◴[] No.45042014[source]
It can be an issue, but only if you have long enough contention periods for the mutex to matter. A Mutex which is only held for short time periouds can work perfectly fine in an async context. And it's only going to cause deadlocks if it would cause deadlocks in a threaded context, since if Mutexes can only be held between yield points then it'll only ever be running tasks that are contending for them.
replies(1): >>45044110 #
12. ◴[] No.45042113{3}[source]
13. 0x1ceb00da ◴[] No.45042126{3}[source]
An async function call returns a future which is an object holding the state of the async computation. Multithreaded runtimes like tokio require this future to be `Send`, which means it can be moved from one thread to another. The future generated by the compiler is not Send if there is a local variable crossing an await boundary that is not `Send`.
14. sunshowers ◴[] No.45042145[source]
Using a Tokio mutex is even more of an antipattern :) come to my RustConf talk about async cancellation next week to find out why!
replies(1): >>45044123 #
15. ◴[] No.45042193{3}[source]
16. VWWHFSfQ ◴[] No.45042232{3}[source]
Rust's marker traits (Send, Sync, etc.) + RAII is really what makes it so magical to me.
17. ViewTrick1002 ◴[] No.45042240{3}[source]
That comes from the where the tokio task is spawned annotating the future with send.

https://docs.rs/tokio/latest/tokio/task/fn.spawn.html

If you want to run everything on the same thread then localset enables that. See how the spawn function does not include the send bound.

https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html

18. veber-alex ◴[] No.45042247{3}[source]
The compiler doesn't know anything about threads.

The compiler knows the Future doesn't implement the Send trait because MutexGuard is not Send and it crosses await points.

Then, tokio the aysnc runtime requires that futures that it runs are Send because it can move them to another thread.

This is how Rust safety works. The internals of std, tokio and other low level libraries are unsafe but they expose interfaces that are impossible to misuse.

19. NobodyNada ◴[] No.45042345{3}[source]
The way that this fully works together is:

- The return type of Mutex::lock() is a MutexGuard, which is a smart pointer type that 1) implements Deref so it can be dereferenced to access the underlying data, 2) implements Drop to unlock the mutex when the guard goes out of scope, and 3) implements !Send so the compiler knows it is unsafe to send between threads: https://doc.rust-lang.org/std/sync/struct.MutexGuard.html

- Rust's implementation of async/await works by transforming an async function into a state machine object implementing the Future trait. The compiler generates an enum that stores the current state of the state machine and all the local variables that need to live across yield points, with a poll function that (synchronously) advances the coroutine to the next yield point: https://doc.rust-lang.org/std/future/trait.Future.html

- In Rust, a composite type like a struct or enum automatically implements Send if all of its members implement Send.

- An async runtime that can move tasks between threads requires task futures to implement Send.

So, in the example here: because the author held a lock across an await point, the compiler must store the MutexGuard smart pointer as a field of the Future state machine object. Since MutexGuard is !Send, the future also is !Send, which means it cannot be used with an async runtime that moves tasks between threads.

If the author releases the lock (i.e. drops the lock guard) before awaiting, then the guard does not live across yield points and thus does not need to be persisted as part of the state machine object -- it will be created and destroyed entirely within the span of one call to Future::poll(). Thus, the future object can be Send, meaning the task can be migrated between threads.

20. benmmurphy ◴[] No.45042479[source]
original poster was 'lucky' that he was using a work stealing engine. if the engine was not moving tasks between threads then i think the compiler would have been happy and he could have had the fun of debugging what happens when the same thread tries to lock the same mutex twice. the rust compiler won't save you from this kind of bug.

> The exact behavior on locking a mutex in the thread which already holds the lock is left unspecified. However, this function will not return on the second call (it might panic or deadlock, for example).

21. maplant ◴[] No.45042735{3}[source]
It sounds like you have a deadlock somewhere in your code and it's unlikely that the choice of Mutex really fixed that
replies(1): >>45043991 #
22. vlovich123 ◴[] No.45043991{4}[source]
Or you do something super expensive while holding the mutex.

Or your server is heavily contended enough that all worker threads are blocked on this mutex and no reactor can make forward progress.

23. vlovich123 ◴[] No.45044049{3}[source]
I think that’s a dangerous recommendation and ignores the consequence of doing this within a single threaded reactor (your app will hang permanently if the lock is contended) or assumes you won’t have N threads trying to acquire this lock where N is the size of your thread pool which is entirely possible on a heavily accessed server if the core state that needs to be accessed by every request is locked with said mutex.
replies(1): >>45044232 #
24. vlovich123 ◴[] No.45044110{3}[source]
Imagine if every thread tries to acquire this mutex on every HTTP request. And then you also lock it in a spawn_blocking. On a heavily contended lock, all the reactor threads will end up stuck waiting for the spawn_blocking to finish and the service will appear hung since even requests that don’t even need that lock can’t get processed. And if spawn_blocking is doing synchronous network I/O or reading a file from a hung NFS mount, you’re completely hosed.
replies(1): >>45044763 #
25. vlovich123 ◴[] No.45044123{3}[source]
Most people on this forum are not attending RustConf. It might be helpful to post at least the abstract of your idea.
replies(1): >>45044490 #
26. maplant ◴[] No.45044232{4}[source]
If you're using a single threaded reactor you don't need a mutex at all; a Rc<RefCell<_>> will do just fine. And if you want other tasks to yield until you're done holding the borrow, the solution is simple: don't await until you're done. https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html...

there are absolutely situations where tokio's mutex and rwlock are useful, but the vast majority of the time you shouldn't need them

replies(1): >>45045733 #
27. sunshowers ◴[] No.45044490{4}[source]
The big thing is that futures are passive, so any future can be cancelled at any await point by dropping it or not polling it any more. So if you have a situation like this, which in my experience is a very common way to use mutexes:

  let guard = mutex.lock().await;
  // guard.data is Option<T>, Some to begin with
  let data = guard.data.take(); // guard.data is now None

  let new_data = process_data(data).await;
  guard.data = Some(new_data); // guard.data is Some again
Then you could cancel the future at the await point in between while the lock is held, and as a result guard.data will not be restored to Some.
replies(2): >>45045706 #>>45045804 #
28. rcxdude ◴[] No.45044763{4}[source]
You would be in a similarly bad situation if such a lock was async as well. You want to keep locks as short as possible, regardless, and a hung task with a lock is generally going to propagate issues throughout the system until it's dealt with.
replies(1): >>45045753 #
29. neandrake ◴[] No.45045706{5}[source]
Same would be true for any resource that needs cleaned up, right? Referring to stop-polling-future as canceling is probably not good nomenclature. Typically canceling some work requires cleanup, if only to be graceful let alone properly releasing resources.
replies(1): >>45045790 #
30. vlovich123 ◴[] No.45045733{5}[source]
As you know Rc<RefCell> won't work for a data structure that's being accessed by other threads (e.g. spawn_blocking or explicitly accessed threads). Just because you are using the single-threaded reactor does not mean you don't have other threads to access the data.
replies(1): >>45045859 #
31. vlovich123 ◴[] No.45045753{5}[source]
If you have an async lock that can be held across await points, your motivation to spawn_block decreases. Additionally, even though the long running lock would be a problem, APIs that aren't taking that lock continue to work vs the entire HTTP service being hung which is a huge difference (e.g. health monitoring would time out making the service seem unresponsive vs 1 highly contended task just waiting on a lock).
replies(1): >>45047260 #
32. sunshowers ◴[] No.45045790{6}[source]
Yes, this is true of any resource. But Tokio mutexes, being shared mutable state, are inherently likely to run into bugs in production.

In the Rust community, cancellation is pretty well-established nomenclature for this.

Hopefully the video of my talk will be up soon after RustConf, and I'll make a text version of it as well for people that prefer reading to watching.

replies(1): >>45045836 #
33. vlovich123 ◴[] No.45045804{5}[source]
I'm not sure this introduces any new failure though:

    let data = mutex.lock().take();
    let new_data = process_data(data).await;
    *mutex.lock() = Some(new_data);
Here you are using a traditional lock and a cancellation at process_data results in the lock with the undesired state you're worried about. It's a general footgun of cancellation and asynchronous tasks that at every await boundary your data has to be in some kind of valid internally consistent state because the await may never return. To fix this more robustly you'd need the async drop language feature.
replies(1): >>45045842 #
34. neandrake ◴[] No.45045836{7}[source]
Thank you, I look forward to watching your presentation.
35. sunshowers ◴[] No.45045842{6}[source]
True! This is the case with std mutexes as well. But holding a std MutexGuard across an await point makes the future not Send, and therefore not typically spawnable on a Tokio runtime [1]. This isn't really an intended design decision, though, as far as I can tell -- just one that worked out to avoid this footgun.

Tokio MutexGuards are Send, unfortunately, so they are really prone to cancellation bugs.

(There's a related discussion about panic-based cancellations and mutex poisoning, which std's mutex has but Tokio's doesn't either.)

[1] spawn_local does exist, though I guess most people don't use it.

replies(1): >>45045943 #
36. maplant ◴[] No.45045859{6}[source]
It seems like you've found a very specific situation where tokio's Mutex is helpful but it is simply not common to be structuring an async rust application with a single threaded reactor and copious uses of spawn blocking. The advice to go for the std Mutex first is generally good
37. vlovich123 ◴[] No.45045943{7}[source]
You argument then is that the requirement to acquire the lock multiple times makes it more likely you'll think about cancellation & keeping it in a valid interim state? Otherwise I'm not sure how MutexGuards being send really makes it any more or less prone to cancellation bugs.
replies(2): >>45046029 #>>45049526 #
38. sunshowers ◴[] No.45046029{8}[source]
Right, in general a lot of uses of mutexes are to temporarily violate invariants that are otherwise upheld while the mutex is released, something that can usually be reasoned out locally. Reasoning about cancellation at an await point is inherently non-local and so is much harder to do. (And Rust is all about scaling up local reasoning to global correctness, so async cancellation feels like a knife in the back to many practitioners.)

The generally recommended alternative is message passing/channels/"actor model" where there's a single owner of data which ensures cancellation doesn't occur -- or, at least that if cancellation happens the corresponding invalid state is torn down as well. But that has its own pitfalls, such as starvation.

This is all very unsatisfying, unfortunately.

39. rcxdude ◴[] No.45047260{6}[source]
>e.g. health monitoring would time out making the service seem unresponsive vs 1 highly contended task just waiting on a lock

If anything that's a disadvantage. You want your health monitoring to be the canary, not something that keeps on trucking even if the system is no longer doing useful work. (See the classic safety critical software fail of 'I need a watchdog... I'll just feed it regularly in an isolated task')

replies(1): >>45053738 #
40. sunshowers ◴[] No.45049526{8}[source]
I realize now that you were proposing unlocking and relocking the mutex several times. Generally, if someone unlocks a mutex in the middle, they're already thinking about races with other callers (even if the code isn't cancelled in between).

You can definitely argue that developers should think about await points the same way they think about letting go of the mutex entirely, in case cancellation happens. Are mutexes conducive to that kind of thinking? Practically, I've found this to be very easy to get wrong.

41. vlovich123 ◴[] No.45053738{7}[source]
I disagree and hope I can convince you. You're assuming that any request has an equal probability of causing this problem so it's kind of OK if the server deadlocks under load so that the failing health check triggers a restart by whatever is watching it. However, let's say you have three API endpoints:

    /healthz
    /very_common_operation
    /may_deadlock_server
Normally, /may_deadlock_server doesn't get enough traffic to cause problems (let's say it's 10 RPS and 1000 RPS is /very_common_operation and the server operates fine). However, a sudden influx of requests to /may_deadlock_server may cause your service to deadlock (and not a lot, let's say on the order of a few hundred requests). Do you still want the server to lock up completely and forever and wait for a healthz timeout to reboot the service? What if healthz still remains fine but the entire service goes from 10ms response times for requests to 200ms, just enough to cause problems but not enough to make healthz actually unavailable? And all this just because /may_deadlock saw a spike in traffic. And also, the failing healthz check just restarts your service but it won't mitigate the traffic spike if it's sustained. Now consider also that /may_deadlock_server is a trivial gadget for an attacker to DOS your site.

Or do you want the web server responding healthily & rely on metrics and alerts to let you know that /may_deadlock_server is taking a long time to handle requests / impacting performance? Your health monitoring is an absolute last step for automatically mitigating an issue but it'll only help if the bug is some state stuck in a transient state - if it'll restart into the same conditions leading to the starvation then you're just going to be in an infinite reboot loop which is worse.

Healthz is not an alternative to metrics and alerting - it's a last stopgap measure to try to automatically get out of a bad situation. But it can also cause a worse problem if the situation is outside of the state of the service - so generally you want the service to remain available if a reboot wouldn't fix the problem.