Distributed Locking
A distributed lock ensures that only one process across a cluster can enter a critical section at a time. The core problem: how do you coordinate access to a shared resource when the processes live on different machines and communicate only over a network that can fail?
Why Distributed Locks Are Hard
A local mutex works because a single machine maintains authoritative state. In a distributed system:
- The lock server can crash or become unreachable
- The lock holder can pause (GC stop-the-world, OS scheduling, network stall) and miss its TTL
- Clocks on different machines drift relative to each other
- The network can partition — the lock holder can’t tell if it still holds the lock
The fundamental tension is between safety (only one holder at a time) and liveness (the lock eventually becomes available if the holder fails).
Redis SETNX: The Simple Case
The single-node Redis lock uses one atomic command:
SET lock_key <unique_value> NX PX <ttl_ms>NX— set only if the key does not exist (atomic check-and-set)PX <ttl_ms>— auto-expire the key after TTL milliseconds (releases lock if holder crashes)<unique_value>— a random UUID generated by the requester (critical for safe release)
sequenceDiagram
participant A as Process A
participant B as Process B
participant R as Redis
A->>R: SET lock:inventory abc123 NX PX 30000
R->>A: OK (lock acquired, expires in 30s)
B->>R: SET lock:inventory xyz789 NX PX 30000
R->>B: (nil) — lock already held by A
Note over A: Do protected work
A->>R: Release lock (Lua script)
R->>A: Lock released
B->>R: SET lock:inventory xyz789 NX PX 30000
R->>B: OK (lock acquired now)Why the Unique Value Matters
Without a unique value, Process A could accidentally release Process B’s lock:
t=0s: A acquires lock (TTL=30s)
t=28s: A is paused (GC stop-the-world)
t=30s: Lock TTL expires
t=31s: B acquires the same lock
t=32s: A wakes up, calls DEL lock:inventory
→ A just deleted B's lock — B is no longer protectedWith a unique value, the release is safe — only the original holder can release:
-- Atomic Lua script: check-then-delete
-- Must be atomic to avoid TOCTOU race condition
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0 -- not our lock, do nothing
end# Acquire
lock_value = str(uuid.uuid4())
acquired = redis.set("lock:inventory", lock_value, nx=True, px=30000)
# Release (run Lua script above)
release_lock(redis, "lock:inventory", lock_value)The TTL Expiry Problem
Even with a unique value, a process pause longer than the TTL causes two processes to be in the critical section simultaneously:
t=0s: A acquires lock, TTL=30s
t=5s: A starts slow database query
t=30s: Lock TTL expires (A is still working — unaware)
t=31s: B acquires the same lock
t=32s: A and B are both in the critical section ← safety violation
t=35s: A finishes and releases — releases B's lock!Lock renewal (heartbeating): A background thread in Process A sends PEXPIRE lock:inventory 30000 every 10 seconds while the lock is held. If A dies, the heartbeat stops and the lock expires normally. Libraries like Redisson (Java) handle this automatically.
Risk: If the lock renewal itself fails (Redis unreachable for 30 seconds), A continues working but the lock is gone. This is an inherent limitation of any TTL-based lock.
Redlock: Multi-Node Quorum
Redlock addresses single-node Redis failure by spreading the lock across N independent Redis nodes (no replication between them). The lock is acquired only if a majority (N/2 + 1) grant it.
Algorithm
Setup: 5 independent Redis nodes (N=5, quorum=3)
1. Record start time T1
2. For each of the 5 nodes, attempt:
SET lock:resource <uuid> NX PX <ttl_ms>
with a small per-node timeout (e.g., 5ms) to avoid blocking on slow nodes
3. Record elapsed time: elapsed = now() - T1
4. Lock is acquired IF:
- Acquired on ≥ 3 nodes (quorum)
- elapsed < TTL (there is still valid lock time remaining)
5. Effective lock time = TTL - elapsed
6. To release: run the Lua delete script on ALL 5 nodes (regardless of which granted the lock)sequenceDiagram
participant A as Process A
participant R1 as Redis Node 1
participant R2 as Redis Node 2
participant R3 as Redis Node 3
participant R4 as Redis Node 4
participant R5 as Redis Node 5
Note over A: T1 = now()
A->>R1: SET lock NX PX 10000
R1->>A: OK ✓
A->>R2: SET lock NX PX 10000
R2->>A: OK ✓
A->>R3: SET lock NX PX 10000
R3->>A: OK ✓
A->>R4: SET lock NX PX 10000
R4->>A: (nil) — already held
A->>R5: SET lock NX PX 10000
Note over R5: Node 5 is down
Note over A: Acquired 3/5 nodes in 8ms
Effective TTL = 10000 - 8 = 9992ms
Quorum met (3 ≥ 3) — lock acquired ✓Why majority quorum provides safety: If Process A holds the lock on nodes {1, 2, 3} and Process B tries to acquire, B cannot get 3 nodes — at best it gets nodes {4, 5} and fails to reach quorum. The sets overlap by at least one node.
The Controversy: Clock Pause + Redlock
Martin Kleppmann’s 2016 analysis showed that Redlock is still unsafe for correctness-critical operations even with quorum:
t=0: A acquires Redlock on 3/5 nodes (TTL=10s)
t=9.9s: A is paused (OS scheduler, GC)
t=10s: Locks on all 3 nodes expire
t=10.1s: B acquires Redlock on 3/5 nodes
t=10.2s: A wakes up — thinks it still holds the lock
t=10.3s: A and B are both in the critical section ← safety violationNo amount of quorum or TTL tuning can prevent a process from pausing longer than its TTL. Antirez (Redis creator) responded that this failure mode applies to any TTL-based lock and is acceptable for efficiency locks (e.g., avoid duplicate work) but not correctness locks (e.g., financial operations).
Practical verdict:
- Use Redlock for coordinating background jobs — if two workers occasionally both process the same task, it’s an efficiency issue, not a correctness failure
- Use fencing tokens (below) when the correctness of the resource operation matters
Fencing Tokens: The Correct Solution
A fencing token is a monotonically increasing number issued by the lock server. The resource server validates it — if a stale lock holder tries to write with an old token, the write is rejected.
sequenceDiagram
participant A as Process A
participant B as Process B
participant LS as Lock Server
participant RS as Resource Server
A->>LS: Acquire lock
LS->>A: Lock granted, token=33
Note over A: A pauses (GC) — lock expires
B->>LS: Acquire lock
LS->>B: Lock granted, token=34
B->>RS: Write data (token=34)
RS->>RS: last_token = 34 ≥ 34 → accept
RS->>B: Write OK
Note over A: A wakes up — thinks it still has the lock
A->>RS: Write data (token=33)
RS->>RS: last_token = 34 > 33 → REJECT ✓
RS->>A: Error: fencing token too oldHow it works:
- Lock server increments a counter on every lock grant
- Lock holder includes the token with every resource operation
- Resource server tracks the highest token it has seen
- Any operation with a token ≤ last_seen_token is rejected as stale
This is the only mechanism that provides safety regardless of how long a process pauses. It requires the resource server to be aware of tokens — this is not always possible (e.g., a filesystem does not validate fencing tokens natively).
ZooKeeper’s zxid as a fencing token: ZooKeeper’s transaction ID (zxid) is monotonically increasing across the cluster. After acquiring a ZooKeeper lock, pass the zxid to your resource server as the fencing token.
ZooKeeper: Stronger Guarantees via Ephemeral Sequential Znodes
ZooKeeper provides distributed locking with two key primitives that eliminate the TTL/expiry problem entirely:
- Ephemeral nodes — automatically deleted when the client session ends (crash, network failure, clean close). No TTL needed.
- Sequential nodes — ZooKeeper appends a monotonically increasing sequence number to the node name.
Lock Implementation
sequenceDiagram
participant A as Client A
participant B as Client B
participant C as Client C
participant ZK as ZooKeeper
A->>ZK: create /locks/resource-SEQUENTIAL (ephemeral)
ZK->>A: /locks/resource-0000000001
B->>ZK: create /locks/resource-SEQUENTIAL (ephemeral)
ZK->>B: /locks/resource-0000000002
C->>ZK: create /locks/resource-SEQUENTIAL (ephemeral)
ZK->>C: /locks/resource-0000000003
A->>ZK: list /locks/ → [0001, 0002, 0003]
Note over A: 0001 is lowest → lock acquired ✓
B->>ZK: list /locks/ → [0001, 0002, 0003]
Note over B: 0002 is not lowest
watch node 0001 for deletion
Note over A: A crashes — ZooKeeper detects session timeout
ZK->>ZK: delete /locks/resource-0000000001 (ephemeral)
ZK->>B: watch fired: 0001 deleted
B->>ZK: list /locks/ → [0002, 0003]
Note over B: 0002 is lowest → lock acquired ✓No thundering herd: Each waiter watches only the node immediately before it — not all nodes. When a lock is released, only one waiter is notified. This prevents N-1 clients all waking up simultaneously when a lock is released.
Automatic cleanup: Ephemeral nodes require no cleanup code. If the lock holder’s process or network dies, ZooKeeper’s session expiration removes the node automatically — no TTL tuning needed.
Comparison: Redis vs ZooKeeper
| Property | Redis SETNX | Redlock | ZooKeeper |
|---|---|---|---|
| Setup complexity | Very low | Medium (5 nodes) | High (separate cluster) |
| Lock release on crash | TTL expiry (delayed) | TTL expiry (delayed) | Immediate (ephemeral node) |
| Fencing tokens | No (must add separately) | No | Yes (zxid) |
| Consistency model | AP — loses lock on partition if Redis is majority-side | Quorum-based | CP — minority partition unavailable |
| Throughput | ~100k ops/s | Lower (5 round trips) | ~10–50k ops/s |
| False safety window | Yes (TTL-based) | Yes (TTL-based, smaller) | Minimal (session-based) |
| Best for | Efficiency locks, cache coordination | HA efficiency locks | Correctness locks, leader election |
When NOT to Use Distributed Locks
Distributed locks are often reached for prematurely. Most of the problems they solve have better alternatives.
Better Alternatives
| Problem | Naive approach | Better approach |
|---|---|---|
| Prevent double payment | Lock on payment operation | Idempotency key + DB unique constraint |
| Prevent inventory oversell | Lock on inventory check | UPDATE inventory SET stock = stock - 1 WHERE stock > 0 (atomic decrement) |
| Only one worker processes a job | Lock on job ID | Message queue with single consumer per partition (Kafka partition, SQS FIFO) |
| Cache stampede | Lock on cache miss | Redis SETNX per cache key + probabilistic early refresh |
| Cron job deduplication | Distributed lock | DB INSERT ... ON CONFLICT DO NOTHING with row expiry |
Optimistic Concurrency: No Lock at All
For read-modify-write patterns, optimistic concurrency is often safer and more performant than a lock:
-- Read: get current version
SELECT stock, version FROM inventory WHERE product_id = 42;
-- Returns: stock=10, version=7
-- Modify in application: new_stock = 10 - 1 = 9
-- Write: only update if version hasn't changed (no lock held!)
UPDATE inventory
SET stock = 9, version = 8
WHERE product_id = 42 AND version = 7;
-- Check rows_affected:
-- 1 row → success, no concurrent modification
-- 0 rows → version changed by another writer → retryThis is effectively a compare-and-swap. Under low contention it outperforms locks significantly (no waiting). Under high contention, retries increase — locks may be better.
The Cost of Locks
Every lock acquisition is at minimum one network round trip:
- Redis SETNX: 1 RTT (~1ms local, ~10ms cross-region)
- Redlock: 5 parallel RTTs (duration of slowest)
- ZooKeeper: 1 RTT + session maintenance overhead
Locks also add failure modes: what if lock acquisition fails? what if the lock holder crashes mid-operation without releasing? what if the lock TTL is too short for heavy load? Each question requires code to handle.
The rule: If you can make the operation idempotent and use a DB constraint or atomic increment, do that instead. Use distributed locks only when you need to coordinate access to an external resource that doesn’t support transactions or atomic operations.
A common interview mistake: proposing a distributed lock for every race condition. The follow-up question is always “what happens if the lock holder crashes after acquiring the lock but before completing the operation?” If the operation is not idempotent, a distributed lock just moves the problem — you still need to handle partial state. Design the operation to be idempotent first; add a lock only if you truly need mutual exclusion for the duration of execution (not just for deduplication).