How to do distributed locking (2016)

(martin.kleppmann.com)

244 points yusufaytas | 1 comments | 20 Oct 24 10:38 UTC | HN request time: 0.21s | source

Show context

dataflow ◴[20 Oct 24 17:09 UTC] No.41896851[source]▶

> The lock has a timeout (i.e. it is a lease), which is always a good idea (otherwise a crashed client could end up holding a lock forever and never releasing it). However, if the GC pause lasts longer than the lease expiry period, and the client doesn’t realise that it has expired, it may go ahead and make some unsafe change.

Hold on, this sounds absurd to me:

First, if your client crashes, then you don't need a timed lease on the lock to detect this in the first place. The lock would get released by the OS or supervisor, whether there are any timeouts or not. If both of those crash too, then the connection would eventually break, and the network system should then detect that (via network resets or timeouts, lack of heartbeats, etc.) and then invalidate all your connections before releasing any locks.

Second, if the problem becomes that your client is buggy and thus holds the lock too long without crashing, then shouldn't some kind of supervisor detect that and then kill the client (e.g., by the OS terminating the process) before releasing the lock for everybody else?

Third, if you are going to have locks with timeouts to deal with corner cases you can't handle like the above, shouldn't they notify the actual program somehow (e.g., by throwing an exception, raising a signal, terminating it, etc.) instead of letting it happily continue execution? And shouldn't those cases wait for some kind of verification that the program was notified before releasing the lock?

The whole notion that timeouts should somehow permit the program execution to continue ordinary control flow sounds like the root cause of the problem, and nobody is even batting an eye at it? Is there an obvious reason why this makes sense? I feel I must be missing something here... what am I missing?

replies(2): >>41897032 #>>41897034 #

winwang ◴[20 Oct 24 17:34 UTC] No.41897034[source]▶

>>41896851 #

This isn't a mutex, but the distributed equivalent of one. The storage service is the one who invalidates the lock on their side. The client won't detect its own issues without additional guarantees not given (supposedly) by Redlock.

replies(2): >>41897096 #>>41897136 #

dataflow ◴[20 Oct 24 17:48 UTC] No.41897136[source]▶

>>41897034 #

I understand that. What I'm hung up on is, why does the storage system feel it is at liberty to just invalidate a lock and thus let someone else reacquire it without any sort of acknowledgment (either from the owner or from the communication systems connecting the owner to the outside world) that the owner will no longer rely on it? It just seems fundamentally wrong. The lock service just... doesn't have that liberty, as I see it.

replies(1): >>41897225 #

winwang ◴[20 Oct 24 18:02 UTC] No.41897225[source]▶

>>41897136 #

What if the rack goes down? But I think the author is saying a similar thing to you. The fenced token is essentially asserting that the client will no longer rely on the lock, even if it tries to. The difference is the service doesn't need any acknowledgement, no permission needed to simlly deny the client later.

replies(1): >>41897377 #

dataflow ◴[20 Oct 24 18:21 UTC] No.41897377[source]▶

>>41897225 #

To be clear, my objection is to the premise, not to the offered solution.

To your question, could you clarify what exactly you mean by the rack "going down"? This encompasses a lot of different scenarios, I'm not sure which one you're asking about. The obvious interpretation would break all the connections the program has to the outside world, thus preventing the problem by construction.

replies(2): >>41897707 #>>41898935 #

1. wbl ◴[20 Oct 24 22:15 UTC] No.41898935[source]▶

>>41897377 #

The process that owns the lock is never heard from again.

↑