Databases 25 min read

Why Redis Redlock May Not Be Safe: A Deep Dive into the Redlock Debate

An in‑depth review of the heated debate between Redis creator antirez and distributed‑systems expert Martin Kleppmann over the safety of Redis’s Redlock algorithm, covering single‑node lock pitfalls, failover issues, timing assumptions, fencing tokens, and practical recommendations for when to use Redlock versus simpler locks.

dbaplus Community
dbaplus Community
dbaplus Community
Why Redis Redlock May Not Be Safe: A Deep Dive into the Redlock Debate

Single‑Node Redis Lock

The lock is acquired with a single SET command that sets a unique random value and an expiration: SET resource_name my_random_value NX PX 30000 Parameters: my_random_value – client‑generated random string, unique for the lock’s lifetime. NX – set only if the key does not exist, guaranteeing exclusive acquisition. PX 30000 – automatic expiration (30 s in this example; configurable).

Lock release must be atomic. The recommended Lua script checks the stored value before deleting:

if redis.call("get",KEYS[1]) == ARGV[1] then
  return redis.call("del",KEYS[1])
else
  return 0
end

The script receives my_random_value as ARGV[1 and resource_name as KEYS[1.

Key Problems with the Single‑Node Approach

Without an expiration, a crashed client or network partition leaves the lock held forever.

Splitting the command into SETNX followed by EXPIRE is not atomic; a crash after SETNX makes the lock permanent.

Using a fixed value instead of a random string allows a later client to release a lock it never acquired.

Releasing the lock must be atomic; a three‑step GET → compare → DEL sequence suffers the same race conditions as a non‑atomic acquisition.

During failover, if the master crashes before replicating the lock key, a client may acquire the same lock on the promoted slave, breaking safety.

Redlock Algorithm

To mitigate failover issues, Redlock uses N independent Redis masters (commonly five). The client performs:

Record the current time in milliseconds.

Attempt to acquire the lock on each node with the same SET resource_name my_random_value NX PX 30000 command, using a short per‑node timeout (much less than the lock’s validity).

Calculate the total elapsed time. If the lock was obtained on a majority (≥ N/2 + 1) of nodes **and** the elapsed time is less than the lock’s validity, the lock is considered acquired.

If successful, recompute the remaining validity as original_validity – elapsed_time.

If not successful, immediately release any partial locks on **all** nodes using the Lua script above.

Releasing a Redlock‑acquired lock is simply running the same Lua script on every node, regardless of whether that node granted the lock.

Redlock tolerates up to ⌊N/2⌋ node failures, but it still relies on timing assumptions and can be affected by node restarts that lose un‑persisted lock data.

Martin Kleppmann’s Critique

In his 2016 blog (https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html) Kleppmann makes two major points:

Even a perfect lock with automatic expiration cannot guarantee safety without a fencing token that the protected resource can verify. A monotonically increasing token is attached to each lock acquisition; the resource rejects operations bearing an older token.

Redlock’s safety depends on strong timing assumptions (synchronized clocks, bounded network delays). In an asynchronous model these assumptions break, allowing scenarios where two clients simultaneously believe they hold the lock.

He illustrates the problem with a timeline where a client experiences a long GC pause after acquiring the lock, causing the lock to expire while the client still thinks it holds it. The other client then acquires the lock, leading to conflicting writes.

Antirez’s Counter‑Arguments

Antirez notes that Redlock already checks whether the elapsed acquisition time exceeds the lock’s validity, which would reject the expired‑lock scenario described above. Nevertheless, he acknowledges that timing‑related failures remain a weakness.

He also emphasizes that the release step must be performed on **all** nodes, even those that did not grant the lock, because a response loss could mean the lock was actually set on a node that the client thinks failed.

Usage Scenarios and Recommendations

Efficiency‑oriented use‑cases : occasional lock failures are tolerable (e.g., duplicate email sending). A simple single‑node lock may be sufficient.

Correctness‑oriented use‑cases : any lock failure is unacceptable (e.g., financial transactions). Redlock is unsuitable; alternatives such as Zookeeper, etcd, or transactional databases that provide strong consistency and fencing mechanisms should be used.

Open Questions

Open issues include the practicality of implementing fencing tokens across distributed resource servers, whether a lock + fencing scheme can ever be provably correct, and how to choose an appropriate lock validity time that balances expiration risk against prolonged lock retention.

The debate highlights the complexity of designing safe distributed locks and the need to match the algorithm to the consistency and failure‑model requirements of the application.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

redisdistributed-lockConsistencyfailoverRedlock
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.