How to do distributed locking (2016)

(martin.kleppmann.com)

244 points yusufaytas | 1 comments | 20 Oct 24 10:38 UTC | HN request time: 0.308s | source

Show context

dataflow ◴[20 Oct 24 17:09 UTC] No.41896851[source]▶

> The lock has a timeout (i.e. it is a lease), which is always a good idea (otherwise a crashed client could end up holding a lock forever and never releasing it). However, if the GC pause lasts longer than the lease expiry period, and the client doesn’t realise that it has expired, it may go ahead and make some unsafe change.

Hold on, this sounds absurd to me:

First, if your client crashes, then you don't need a timed lease on the lock to detect this in the first place. The lock would get released by the OS or supervisor, whether there are any timeouts or not. If both of those crash too, then the connection would eventually break, and the network system should then detect that (via network resets or timeouts, lack of heartbeats, etc.) and then invalidate all your connections before releasing any locks.

Second, if the problem becomes that your client is buggy and thus holds the lock too long without crashing, then shouldn't some kind of supervisor detect that and then kill the client (e.g., by the OS terminating the process) before releasing the lock for everybody else?

Third, if you are going to have locks with timeouts to deal with corner cases you can't handle like the above, shouldn't they notify the actual program somehow (e.g., by throwing an exception, raising a signal, terminating it, etc.) instead of letting it happily continue execution? And shouldn't those cases wait for some kind of verification that the program was notified before releasing the lock?

The whole notion that timeouts should somehow permit the program execution to continue ordinary control flow sounds like the root cause of the problem, and nobody is even batting an eye at it? Is there an obvious reason why this makes sense? I feel I must be missing something here... what am I missing?

replies(2): >>41897032 #>>41897034 #

neonbrain ◴[20 Oct 24 17:33 UTC] No.41897032[source]▶

>>41896851 #

The assumption that your server will always receive RST or FIN from your client is incorrect. There are some cases when these packets are being dropped, and your server will stay with an open connection while the client on the remote machine is already dead. P.S. BTW, it's not me who downvoted you

replies(1): >>41897097 #

dataflow ◴[20 Oct 24 17:43 UTC] No.41897097[source]▶

>>41897032 #

I made no such assumption this will always happen though? That's why the comment was so much longer than just "isn't TCP RST enough?"... I listed a ton of ways to deal with this that didn't involve letting the program continue happily on its path.

replies(1): >>41898479 #

neonbrain ◴[20 Oct 24 21:08 UTC] No.41898479[source]▶

>>41897097 #

Sorry didn't see your message. What I mean is that if you are not getting RST/FIN or any other indication for your closed communication channel, you only left to the mechanism of timeouts to recognize a partitioned/dead/slow worker client. Basically, you've mentioned them yourself ("timeouts, lack of heartbeats, etc" in your post are all forms of timeouts). So you can piggyback on these timeouts or use a smaller timeout configured in the lease, whatever suits your purpose, I guess. This is what I believe Kleppmann referring here to. He's just being generic in his description.

replies(1): >>41898562 #

1. dataflow ◴[20 Oct 24 21:19 UTC] No.41898562[source]▶

>>41898479 #

> What I mean is that if you are not getting RST/FIN or any other indication for your closed communication channel, you only left to the mechanism of timeouts to recognize a partitioned/dead/slow worker client.

Timeouts were a red herring in my comment. My problem wasn't with the mere existence of timeouts in corner cases, it was the fact that the worker is assumed to keep working merrily on, despite the timeouts. That's what I don't understand the justification for. If the worker is dead, then it's a non-issue, and the lease can be broken. If the system is alive, the host can discover (via RST, heartbeats, or other timeouts) that the storage system is unreachable, and thus prevent the program from continuing execution -- and at that point the storage service can still break the lease (via a timeout), but it would actually come with a timing-based guarantee that the program will no longer continue execution.

↑