How to do distributed locking (2016)

(martin.kleppmann.com)

244 points yusufaytas | 3 comments | 20 Oct 24 10:38 UTC | HN request time: 0.603s | source

Show context

dataflow ◴[20 Oct 24 17:09 UTC] No.41896851[source]▶

> The lock has a timeout (i.e. it is a lease), which is always a good idea (otherwise a crashed client could end up holding a lock forever and never releasing it). However, if the GC pause lasts longer than the lease expiry period, and the client doesn’t realise that it has expired, it may go ahead and make some unsafe change.

Hold on, this sounds absurd to me:

First, if your client crashes, then you don't need a timed lease on the lock to detect this in the first place. The lock would get released by the OS or supervisor, whether there are any timeouts or not. If both of those crash too, then the connection would eventually break, and the network system should then detect that (via network resets or timeouts, lack of heartbeats, etc.) and then invalidate all your connections before releasing any locks.

Second, if the problem becomes that your client is buggy and thus holds the lock too long without crashing, then shouldn't some kind of supervisor detect that and then kill the client (e.g., by the OS terminating the process) before releasing the lock for everybody else?

Third, if you are going to have locks with timeouts to deal with corner cases you can't handle like the above, shouldn't they notify the actual program somehow (e.g., by throwing an exception, raising a signal, terminating it, etc.) instead of letting it happily continue execution? And shouldn't those cases wait for some kind of verification that the program was notified before releasing the lock?

The whole notion that timeouts should somehow permit the program execution to continue ordinary control flow sounds like the root cause of the problem, and nobody is even batting an eye at it? Is there an obvious reason why this makes sense? I feel I must be missing something here... what am I missing?

replies(2): >>41897032 #>>41897034 #

winwang ◴[20 Oct 24 17:34 UTC] No.41897034[source]▶

>>41896851 #

This isn't a mutex, but the distributed equivalent of one. The storage service is the one who invalidates the lock on their side. The client won't detect its own issues without additional guarantees not given (supposedly) by Redlock.

replies(2): >>41897096 #>>41897136 #

dataflow ◴[20 Oct 24 17:48 UTC] No.41897136[source]▶

>>41897034 #

I understand that. What I'm hung up on is, why does the storage system feel it is at liberty to just invalidate a lock and thus let someone else reacquire it without any sort of acknowledgment (either from the owner or from the communication systems connecting the owner to the outside world) that the owner will no longer rely on it? It just seems fundamentally wrong. The lock service just... doesn't have that liberty, as I see it.

replies(1): >>41897225 #

winwang ◴[20 Oct 24 18:02 UTC] No.41897225[source]▶

>>41897136 #

What if the rack goes down? But I think the author is saying a similar thing to you. The fenced token is essentially asserting that the client will no longer rely on the lock, even if it tries to. The difference is the service doesn't need any acknowledgement, no permission needed to simlly deny the client later.

replies(1): >>41897377 #

dataflow ◴[20 Oct 24 18:21 UTC] No.41897377[source]▶

>>41897225 #

To be clear, my objection is to the premise, not to the offered solution.

To your question, could you clarify what exactly you mean by the rack "going down"? This encompasses a lot of different scenarios, I'm not sure which one you're asking about. The obvious interpretation would break all the connections the program has to the outside world, thus preventing the problem by construction.

replies(2): >>41897707 #>>41898935 #

winwang ◴[20 Oct 24 19:06 UTC] No.41897707[source]▶

>>41897377 #

The rack could go down from the point of view of the storage service, but the machine/VM itself could be perfectly fine.

replies(1): >>41898443 #

dataflow ◴[20 Oct 24 21:04 UTC] No.41898443[source]▶

>>41897707 #

In that scenario the machine would become aware that it can't reach the storage service either, no? In which case the host can terminate the program, or the network can break all the connections between them, or whatever. By default I would think that the lease shouldn't be broken until the network partition gets resolved, but I think the storage system could have a timeout for breaking the lease in that scenario if you really want, but then it would come with a time-based guarantee that the program isn't running anymore, no?

replies(1): >>41898912 #

winwang ◴[20 Oct 24 22:11 UTC] No.41898912[source]▶

>>41898443 #

Everything you're saying is plausibly possible in the absurdly large search space of all possible scenarios. The author's premise, however, is rooted in the specific scenario they lay out, with historical supporting examples which you can look into. Even then, the premise before all that was essentially: Redlock does not do what people might expect of a distributed lock. Btw I do have responses to your questions, but often times in these sorts of discussions, I find that there can always be an objection to an objection to ... etc. The "sense" (or flavor) in this case is that "we are taking a complex topic too lightly". In fact, I should probably continue reading the author's book (DDIA) at some point...

replies(1): >>41899355 #

1. dataflow ◴[20 Oct 24 23:24 UTC] No.41899355[source]▶

>>41898912 #

> The "sense" (or flavor) in this case is that "we are taking a complex topic too lightly".

I get that -- and honestly, I'm not expecting a treatise on distributed consensus here. But what took me aback was that the blog post didn't even attempt to mention anything about the fact that the premise (at first glance) looks glaringly broken. If he'd even said 1 single sentence like "it's {difficult/infeasible/impossible} to design a client that will never continue execution past a timeout", it'd have been fine, and I would've happily moved along. But the way it is written right now, it reads a little bit like: "we design a ticking time bomb that we can't turn off; how can we make sure we don't forget to reset the timer every time?"... without bothering to say anything about why we should be digging ourselves into such a hole in the first place.

replies(1): >>41900382 #

2. winwang ◴[21 Oct 24 03:16 UTC] No.41900382[source]▶

>>41899355 (TP) #

Yeah, that makes sense now. I think, personally, I've simply seen that design around a bunch, but great on you to question it and call it out -- also plausible that my own headcanon doesn't check out.

replies(1): >>41900444 #

3. dataflow ◴[21 Oct 24 03:31 UTC] No.41900444[source]▶

>>41900382 #

Thanks, yeah. For what it's worth, partly what led me to even leave this comment is that when he wrote "the code above is broken", I stared at it, and for the life of me I couldn't see why it was broken. Because, of course, the code was lying: there was no mention of leases or timeouts. Having a "lease" suddenly pulled out of nowhere really felt like a fast one being pulled on me (and really unfairly so!), hence I decided I'd actually leave the comment and question what the basis for this hidden time bomb even was in the first place?! If the code had said leaseLock(filename, timeout), I think the bug would've been glaringly obvious, and far fewer people would've been surprised by looking at the code.

Also for what it's worth, I can guess what some of the answers might be. For example, it's possible you'd need very precise timing facilities that aren't always available, in order to be able to guarantee high throughput with correctness (like Google Spanner's). Or it might be that doing so requires a trade-off between availability and partition-tolerance that in some applications isn't justified. But I'm curious what the answer actually is, rather than just (semi-)random guesses as to what it could be.

↑