←back to thread

240 points yusufaytas | 4 comments | | HN request time: 0s | source
Show context
antirez ◴[] No.41895138[source]
I suggest reading the comment I left back then in this blog post comments section, and the reply I wrote in my blog.

Btw, things to note in random order:

1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.

2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.

3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.

4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.

replies(2): >>41895296 #>>41895393 #
computerfan494 ◴[] No.41895393[source]
To be honest I've long been puzzled by your response blog post. Maybe the following question can help achieve common ground:

Would you use RedLock in a situation where the timeout is fairly short (1-2 seconds maybe), the work done usually takes ~90% of that timeout, and the work you do while holding a RedLock lock MUST NOT be done concurrently with another lock holder?

I think the correct answer here is always "No" because the risk of the lease sometimes expiring before the client has finished its work is very high. You must alter your work to be idempotent because RedLock cannot guarantee mutual exclusion under all circumstances. Optimistic locking is a good way to implement this type of thing while the work done is idempotent.

replies(2): >>41895434 #>>41895556 #
1. antirez ◴[] No.41895556[source]
The timeout must be much larger than the time required to do the work. The point is that distributed locks without a release mechanism are in practical terms very problematic.

Btw, things to note in random order:

1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.

2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.

3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.

4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.

replies(1): >>41895614 #
2. computerfan494 ◴[] No.41895614[source]
Locking without a timeout is indeed in the majority of use-cases a non-starter, we are agreed there.

The critical point that users must understand is that it is impossible to guarantee that the RedLock client never holds its lease longer than the timeout. Compounding this problem is that the longer you make your timeout to minimize the likelihood of this from accidentally happening, the less responsive your system becomes during genuine client misbehaviour.

replies(1): >>41896193 #
3. antirez ◴[] No.41896193[source]
In most real world scenarios, the tradeoffs are a bit softer than what people in the formal world dictates (and doing so they forced certain systems to become suboptimal for everything but during failures, kicking them out of business...). Few examples:

1. E-commerce system where there are a limited amount of items of the same kind, you don't want to oversell.

2. Hotel booking system where we don't want to reserve the same dates/rooms multiple times.

3. Online medical appointments system.

In all those systems, to re-open the item/date/... after some time it's ok, even after one day. And if the lock hold time is not too big, but a very strict compromise (it's also a reasonable choice in the spectrum), and it could happen that during edge case failures three items are sold and there are two, orders can be cancelled.

So yes, there is a tension between timeout, race condition, recovery time, but in many systems using something like RedLock the development and end-user experience can be both improved with a high rate of success, and the random unhappy event can be handled. Now the algorithm is very old, still used by many implementations, and as we are talking problems are solved in a straightforward way with very good performances. Of course, the developers of the solution should be aware that there are tradeoffs between certain values: but when are distributed systems easy?

P.S. why 10 years of strong usage count, in the face of a blog post telling that you can't trust a system like that? Because even if DS issues emerge randomly and sporadically, in the long run systems that create real-world issues, if they reach mass usage, are known. A big enough user base is a continuous integration test big enough to detect when a solution has real world serious issues. So of course RedLock users picking short timeouts with tasks that take a very hard to predict amount of time, will indeed incur into knonw issues. But the other systemic failure modes described in the blog post are never mentioned by users AFAIK.

replies(1): >>41896517 #
4. computerfan494 ◴[] No.41896517{3}[source]
I feel like you're dancing around admitting the core issue that Martin points out - RedLock is not suitable for systems where correctness is paramount. It can get close, but it is not robust in all cases.

If you want to say "RedLock is correct a very high percentage of the time when lease timeouts are tuned for the workload", I would agree with you actually. I even possibly agree with the statements "most systems can tolerate unlikely correctness failures due to RedLock lease violations. Manual intervention is fine in those cases. RedLock may allow fast iteration times and is worth this cost". I just think it's important to be crystal clear on the guarantees RedLock provides.

I first read Martin's blog post and your response years ago when I worked at a company that was using RedLock despite it not being an appropriate tool. We had an outage caused by overlapping leases because the original implementor of the system didn't understand what Martin has pointed out from the RedLock documentation alone.

I've been a happy Redis user and fan of your work outside of this poor experience with RedLock, by the way. I greatly appreciate the hard work that has gone into making it a fantastic database.