How to do distributed locking (2016)

(martin.kleppmann.com)

244 points yusufaytas | 5 comments | 20 Oct 24 10:38 UTC | HN request time: 1.465s | source

Show context

jroseattle ◴[20 Oct 24 14:33 UTC] No.41895613[source]▶

We reviewed Redis back in 2018 as a potential solution for our use case. In the end, we opted for a less sexy solution (not Redis) that never failed us, no joke.

Our use case: handing out a ticket (something with an identifier) from a finite set of tickets from a campaign. It's something akin to Ticketmaster allocating seats in a venue for a concert. Our operation was as you might expect: provide a ticket to a request if one is available, assign some metadata from the request to the allocated ticket, and remove it from consideration for future client requests.

We had failed campaigns in the past (over-allocation, under-allocation, duplicate allocation, etc.) so our concern was accuracy. Clients would connect and request a ticket; we wanted to exclusively distribute only the set of tickets available from the pool. If the number of client requests exceeded the number of tickets, the system should protect for that.

We tried Redis, including the naive implementation of getting the lock, checking the lock, doing our thing, releasing the lock. It was ok, but administrative overhead was a lot for us at the time. I'm glad we didn't go that route, though.

We ultimately settled on...Postgres. Our "distributed lock" was just a composite UPDATE statement using some Postgres-specific features. We effectively turned requests into a SET operation, where the database would return either a record that indicated the request was successful, or something that indicated it failed. ACID transactions for the win!

With accuracy solved, we next looked at scale/performance. We didn't need to support millions of requests/sec, but we did have some spikiness thresholds. We were able to optimize read/write db instances within our cluster, and strategically load larger/higher-demand campaigns to allocated systems. We continued to improve on optimization over two years, but not once did we ever have a campaign with ticket distribution failures.

Note: I am not an expert of any kind in distributed-lock technology. I'm just someone who did their homework, focused on the problem to be solved, and found a solution after trying a few things.

replies(8): >>41895681 #>>41895829 #>>41895977 #>>41896180 #>>41896281 #>>41896833 #>>41897029 #>>41897993 #

1. nh2 ◴[20 Oct 24 15:12 UTC] No.41895829[source]▶

>>41895613 #

You are right that anything that needs up to 50000 atomic, short-lived transactions per second can just use Postgres.

Your UPDATE transaction lasts just a few microseconds, so you can just centralise the problem and that's good because it's simpler, faster and safer.

But this is not a _distributed_ problem, as the article explains:

> remember that a lock in a distributed system is not like a mutex in a multi-threaded application. It’s a more complicated beast, due to the problem that different nodes and the network can all fail independently in various ways

You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock.

replies(2): >>41897501 #>>41900138 #

2. fny ◴[20 Oct 24 18:38 UTC] No.41897501[source]▶

>>41895829 (TP) #

You could just have multiple clients attempt to update a row that defines the lock. Postgres transactions have no limit and will unwind on client failure. Since connections are persistent, there’s no need to play a game to determine the state of a client.

replies(1): >>41897724 #

3. nh2 ◴[20 Oct 24 19:08 UTC] No.41897724[source]▶

>>41897501 #

Your scenario still uses a centralised single postgres server. Failure of that server takes down the whole locking functionality. That's not what people usually mean by "distributed".

"the machines involved can fail" must also include the postgres machines.

To get that, you need to coordinate multiple postgres servers, e.g. using ... distributed locking. Postgres does not provide that out of the box -- neither multi-master setups, nor master-standby synchronous replication with automatic failover. Wrapper software that provides that, such as Stolon and Patroni, use distributed KV stores / lock managers such as etcd and Consul to provide it.

4. jroseattle ◴[21 Oct 24 02:12 UTC] No.41900138[source]▶

>>41895829 (TP) #

> up to 50000 atomic, short-lived transactions per second

50000?

> You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock. From my experience, locks are needed to ensure synchronized access to resources. Distributed locks are a form of that isolation being held across computing processes, as opposed to the mutex example provided.

And while our implementation definitively did not use a distributed lock, we could still see those machines fail.

I fail to understand why a distributed lock is needed for anything due to it's duration.

replies(1): >>41901116 #

5. throwawaythekey ◴[21 Oct 24 06:07 UTC] No.41901116[source]▶

>>41900138 #

Mostly guessing but -> duration is usually inversely correlated with throughput.

If you require high throughput and have a high duration then partitioning/distribution are the normal solution.

↑