Mastering Distributed Locks and Idempotency for High‑Concurrency Systems
This article explores the challenges of mutual exclusion and idempotency in distributed environments, explains the underlying principles of locks in multi‑threaded and multi‑process contexts, and presents practical implementations using Zookeeper, Redis, Tair, and the Cerberus and GTIS frameworks to ensure reliable, scalable operations.
With the rapid growth of internet data and increasingly complex business logic, systems face high‑concurrency access and massive data processing demands, making low‑cost high availability, scalability, and extensibility essential. Traditional monolithic architectures no longer suffice, leading to the adoption of distributed systems.
Distributed systems consist of loosely coupled independent servers connected via a network, offering scalability, high reliability, high concurrency, and cost‑effectiveness.
However, distributed environments introduce complexities such as clock drift and Byzantine failures, making classic issues like machine crashes and message loss more challenging.
Two typical problems in distributed systems are mutual‑exclusion (mutex) and idempotency, which this article analyzes.
Mutex Problem
Two common examples illustrate mutex issues:
Example 1: Service X holds a value of 100. Request A adds 200 while request B subtracts 100. Without proper handling, both may read 100 and write conflicting results.
Example 2: Two requests randomly pick a task from a shared pool, potentially selecting the same task simultaneously.
These scenarios show that when multiple requests access and modify the same resource without ordering guarantees, unexpected outcomes occur. Traditional databases use ACID transactions, but in distributed settings, distributed locks become a common solution.
Multi‑Thread Environment Solutions and Principles
Solution
Essentially all concurrency patterns solve thread‑conflict problems by serializing access to shared resources.
In Java, ReentrantLock and synchronized provide mutual‑exclusion. ReentrantLock uses CAS and a CLH queue, supporting fair and non‑fair modes.
CAS (Compare‑And‑Swap) atomically updates a memory value only when the expected value matches.
CLH queue is a linked list of waiting threads.
Lock acquisition attempts CAS first; if the lock is held, the thread joins the CLH queue and blocks until the lock is released. Fair locks grant the lock to the queue head, while non‑fair locks may allow a newly arriving thread to acquire it.
final boolean nonfairTryAcquire(int acquires) {
final Thread current = Thread.currentThread();
int c = getState();
if (c == 0) {
if (compareAndSetState(0, acquires)) {
setExclusiveOwnerThread(current);
return true;
}
} else if (current == getExclusiveOwnerThread()) {
int nextc = c + acquires;
if (nextc < 0) // overflow
throw new Error("Maximum lock count exceeded");
setState(nextc);
return true;
}
return false;
}When acquisition fails, the thread may block via LockSupport.park after checking the predecessor node.
final boolean acquireQueued(final Node node, int arg) {
boolean failed = true;
try {
boolean interrupted = false;
for (;;) {
final Node p = node.predecessor();
if (p == head && tryAcquire(arg)) {
setHead(node);
p.next = null; // help GC
failed = false;
return interrupted;
}
if (shouldParkAfterFailedAcquire(p, node) && parkAndCheckInterrupt())
interrupted = true;
}
} finally {
if (failed)
cancelAcquire(node);
}
}
private final boolean parkAndCheckInterrupt() {
LockSupport.park(this);
return Thread.interrupted();
}synchronized works via monitorenter/monitorexit bytecode instructions. The monitor is locked when a thread acquires ownership; re‑entrance increments an entry count, and exit decrements it.
The objectref must be of type reference. Each object is associated with a monitor. A monitor is locked if and only if it has an owner.
Multi‑Process Solution
Processes share critical resources such as printers or shared memory. Mutual exclusion is achieved using operating‑system semaphores (named or unnamed). The classic P (wait) and V (signal) operations control access.
P operation: decrement the semaphore if its value is > 0; otherwise block.
V operation: increment the semaphore and wake a blocked process if any.
Distributed‑Lock Solution
Basic Conditions
Storage space accessible to all nodes (e.g., database, Redis, Zookeeper, Tair).
Unique identifier for each lock.
At least two states (locked/unlocked).
Using a database table with columns lock_id (unique) and status (0 = unlocked, 1 = locked), a naive lock can be implemented:
lock = mysql.get(id);
while (lock.status == 1) {
sleep(100);
}
mysql.update(lock.status = 1);
// critical section
mysql.update(lock.status = 0);Problems with this approach include lack of atomicity, dead locks after network failures, and unlocking by a different host.
Advanced Requirements
Re‑entrancy.
Herd effect mitigation.
Fair vs. non‑fair ordering.
Blocking vs. spinning strategies.
Typical Implementations
Zookeeper uses sequential and ephemeral nodes. Clients create an EPHEMERAL_SEQUENTIAL node, list children, and acquire the lock if their node has the smallest sequence number. They watch the predecessor node for changes.
Redis relies on SETNX (set if not exists) and GETSET to achieve atomic lock acquisition, often with an expiration time to avoid dead locks.
Tair provides an expireLock method that combines lock state and expiration timestamp, avoiding reliance on synchronized clocks.
Cerberus Distributed Lock
Cerberus abstracts multiple engines (Tair, Zookeeper, future Redis) behind a single interface, offering features such as:
One‑interface, multi‑engine support.
JUC‑style API (lock, tryLock, lockInterruptibly, unlock).
One‑click engine downgrade and automatic failover.
Built‑in cluster and authorization mechanisms.
Typical usage mirrors java.util.concurrent.locks.ReentrantLock methods, requiring no extra learning curve.
Idempotency Problem
Idempotency means that multiple calls to an interface produce the same result as a single call. In distributed systems, network glitches, retries, and duplicate messages make idempotency crucial for operations such as order placement or message consumption.
GTIS Solution
GTIS (Global Transaction Idempotent Service) assigns a unique global ID to each business operation using an MD5 hash of a deterministic content key. The ID is stored in an external engine (e.g., Tair) via SETNX. If the key already exists, the operation is considered duplicate.
Workflow:
Business generates a unique transContents and sends it to GTIS.
GTIS creates a global ID (MD5 of transContents).
GTIS attempts SETNX with the ID as key and timestamp+transContents as value.
GTIS returns success/failure; the business proceeds only on success.
After the operation, the business reports the result to GTIS.
If successful, GTIS extends the key’s TTL; otherwise it may delete the key.
Key challenges include handling operation failures, timeouts, and host crashes while ensuring that only the correct ID is removed.
GTIS offers low latency (<2 ms per call), reliability with degradation strategies, simple integration, and flexible parameterization.
Conclusion
Mutex and idempotency issues are pervasive in distributed environments. Distributed locks—implemented via Zookeeper, Redis, Tair, or the Cerberus framework—address mutual exclusion, while GTIS provides a reliable way to guarantee idempotent operations. Both solutions have been deployed in production at large-scale companies, though they depend on external storage reliability and must be used with appropriate lock granularity.
Source: http://blog.csdn.net/zdy0_2004/article/details/52760404
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
