How Alibaba Cloud Implements Reliable Distributed Locks for Shared Resources
Distributed locks ensure exclusive access to shared resources across multiple machines, and this article explains their evolution from single-machine locks, classifies system designs, and details Alibaba Cloud Storage’s practical implementation, covering strict mutual exclusion, availability, and lock-switching efficiency with real-world examples.
Alibaba’s reading: Mutual exclusion for shared resources has long been a challenge for many business systems. In distributed systems, a distributed lock is the common solution. This article discusses the principles, technology choices, and Alibaba Cloud Storage’s concrete practice of distributed locks.
1 From Single-Machine Lock to Distributed Lock
In a single-machine environment, when a shared resource cannot provide mutual exclusion itself, a third‑party (the kernel or a library) must supply a lock so that only one thread or process can access the resource exclusively. In a distributed environment, a distributed lock service provides the same capability across multiple machines.
Abstractly, a distributed lock can be expressed as:
Lock = Resource + Concurrency Control + Ownership Display
Examples of single-machine locks:
Spinlock = BOOL + CAS (optimistic lock)
Mutex = BOOL + CAS + Notification (pessimistic lock)
Both spinlock and mutex rely on an atomic CAS operation on a Boolean resource. Without explicit ownership display, a simple atomic integer is not considered a lock.
In distributed environments, the lock service must also provide availability to handle machine failures.
2 Distributed Lock System Classification
Based on the safety of the lock resource, distributed locks fall into two camps:
Systems based on asynchronous replication (e.g., MySQL, Tair, Redis).
Systems based on the Paxos protocol (e.g., Zookeeper, etcd, Consul).
Asynchronous‑replication systems risk lock loss and usually rely on TTL for short‑lived locks, suitable for time‑sensitive tasks where occasional loss is tolerable.
Paxos‑based systems provide strong consistency via multi‑replica data and lease mechanisms, suitable for long‑lived locks where safety is critical.
3 Alibaba Cloud Storage Distributed Lock
Alibaba Cloud Storage has accumulated extensive experience in improving correctness, availability, and switch efficiency of distributed locks.
1 Strict Mutual Exclusion
The service binds each lock to a unique session; the client sends periodic heartbeats to keep the session alive. If the heartbeat stops, the session and its lock are released, preventing “one lock, multiple owners”.
Even with strict mutual exclusion, a lock can be broken if a client’s operation exceeds the lock’s TTL. The article illustrates this “boundary” scenario and recommends ensuring sufficient TTL or using rollback mechanisms.
To address this, the storage system can introduce an IOFence capability, providing token‑based write protection similar to the Chubby paper.
Redis’s Redlock attempts to solve lock loss with a majority‑vote mechanism but still suffers from unreliable wall‑clock time and heterogeneous system limitations.
Unreliable wall‑clock (NTP) time.
Inability of heterogeneous systems to guarantee strict correctness.
Monotonic time can mitigate wall‑clock issues, but heterogeneous systems still face challenges, as shown in an example where GC causes a client to lose lock ownership.
By using a global token that increments with each lock acquisition, the storage layer can reject stale requests, ensuring data protection.
2 Availability
Continuous heartbeats keep the lock robust, but a dead client may still hold a lock. To safely release such locks, the session can be blacklisted, preventing further heartbeats and allowing the session to expire.
Deleting a lock directly is unsafe because another client may have already acquired it.
Even after deletion, the original holder may still believe it owns the lock, breaking mutual exclusion.
3 Switch Efficiency
When a lock holder crashes, a new client must wait for the session to expire before acquiring the lock. Reducing the session lifetime and increasing heartbeat frequency improves switch precision but adds load to the backend.
By storing a unique identifier in the lock node, a CAS‑based forced release can instantly free a dead lock, eliminating wait time.
4 Conclusion
Distributed locks provide exclusive access to shared resources in distributed environments. When integrating a lock service, one must consider integration cost, reliability, switch precision, and correctness. Proper and continuous optimization of distributed lock usage is essential for robust systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
