←back to thread

517 points bkolobara | 3 comments | | HN request time: 0s | source
Show context
Spivak ◴[] No.45041771[source]
How do you encode the locking issue in the type system, it seems magical? Can you just never hold any locks when calling await, is it smart enough to know that this scheduler might move work between threads?
replies(5): >>45041806 #>>45041833 #>>45041852 #>>45041891 #>>45041898 #
vlovich123 ◴[] No.45041833[source]
Presumably the author is using tokio which requires the future constructed (e.g the async function) to be Send (either because of the rules of Rust or annotated as Send) since tokio is a work-stealing runtime and any thread might end up executing a given future (or even start executing and then during a pause move it for completion on another thread). std::sync::MutexGuard intentionally isn't annotated with Send because there are platforms that require the acquiring thread be the one to unlock the mutex.

One caveat though - using a normal std Mutex within an async environment is an antipattern and should not be done - you can cause all sorts of issues & I believe even deadlock your entire code. You should be using tokio sync primitives (e.g. tokio Mutex) which can yield to the reactor when it needs to block. Otherwise the thread that's running the future blocks forever waiting for that mutex and that reactor never does anything else which isn't how tokio is designed).

So the compiler is warning about 1 problem, but you also have to know to be careful to know not to call blocking functions in an async function.

replies(6): >>45041892 #>>45041938 #>>45041964 #>>45042014 #>>45042145 #>>45042479 #
rcxdude ◴[] No.45042014[source]
It can be an issue, but only if you have long enough contention periods for the mutex to matter. A Mutex which is only held for short time periouds can work perfectly fine in an async context. And it's only going to cause deadlocks if it would cause deadlocks in a threaded context, since if Mutexes can only be held between yield points then it'll only ever be running tasks that are contending for them.
replies(1): >>45044110 #
vlovich123 ◴[] No.45044110[source]
Imagine if every thread tries to acquire this mutex on every HTTP request. And then you also lock it in a spawn_blocking. On a heavily contended lock, all the reactor threads will end up stuck waiting for the spawn_blocking to finish and the service will appear hung since even requests that don’t even need that lock can’t get processed. And if spawn_blocking is doing synchronous network I/O or reading a file from a hung NFS mount, you’re completely hosed.
replies(1): >>45044763 #
rcxdude ◴[] No.45044763{3}[source]
You would be in a similarly bad situation if such a lock was async as well. You want to keep locks as short as possible, regardless, and a hung task with a lock is generally going to propagate issues throughout the system until it's dealt with.
replies(1): >>45045753 #
1. vlovich123 ◴[] No.45045753{4}[source]
If you have an async lock that can be held across await points, your motivation to spawn_block decreases. Additionally, even though the long running lock would be a problem, APIs that aren't taking that lock continue to work vs the entire HTTP service being hung which is a huge difference (e.g. health monitoring would time out making the service seem unresponsive vs 1 highly contended task just waiting on a lock).
replies(1): >>45047260 #
2. rcxdude ◴[] No.45047260[source]
>e.g. health monitoring would time out making the service seem unresponsive vs 1 highly contended task just waiting on a lock

If anything that's a disadvantage. You want your health monitoring to be the canary, not something that keeps on trucking even if the system is no longer doing useful work. (See the classic safety critical software fail of 'I need a watchdog... I'll just feed it regularly in an isolated task')

replies(1): >>45053738 #
3. vlovich123 ◴[] No.45053738[source]
I disagree and hope I can convince you. You're assuming that any request has an equal probability of causing this problem so it's kind of OK if the server deadlocks under load so that the failing health check triggers a restart by whatever is watching it. However, let's say you have three API endpoints:

    /healthz
    /very_common_operation
    /may_deadlock_server
Normally, /may_deadlock_server doesn't get enough traffic to cause problems (let's say it's 10 RPS and 1000 RPS is /very_common_operation and the server operates fine). However, a sudden influx of requests to /may_deadlock_server may cause your service to deadlock (and not a lot, let's say on the order of a few hundred requests). Do you still want the server to lock up completely and forever and wait for a healthz timeout to reboot the service? What if healthz still remains fine but the entire service goes from 10ms response times for requests to 200ms, just enough to cause problems but not enough to make healthz actually unavailable? And all this just because /may_deadlock saw a spike in traffic. And also, the failing healthz check just restarts your service but it won't mitigate the traffic spike if it's sustained. Now consider also that /may_deadlock_server is a trivial gadget for an attacker to DOS your site.

Or do you want the web server responding healthily & rely on metrics and alerts to let you know that /may_deadlock_server is taking a long time to handle requests / impacting performance? Your health monitoring is an absolute last step for automatically mitigating an issue but it'll only help if the bug is some state stuck in a transient state - if it'll restart into the same conditions leading to the starvation then you're just going to be in an infinite reboot loop which is worse.

Healthz is not an alternative to metrics and alerting - it's a last stopgap measure to try to automatically get out of a bad situation. But it can also cause a worse problem if the situation is outside of the state of the service - so generally you want the service to remain available if a reboot wouldn't fix the problem.