How Alibaba Cloud Implements Reliable Distributed Locks for Shared Resources

Distributed locks ensure exclusive access to shared resources across multiple machines, and this article explains their evolution from single-machine locks, classifies system designs, and details Alibaba Cloud Storage’s practical implementation, covering strict mutual exclusion, availability, and lock-switching efficiency with real-world examples.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Cloud Implements Reliable Distributed Locks for Shared Resources

Alibaba’s reading: Mutual exclusion for shared resources has long been a challenge for many business systems. In distributed systems, a distributed lock is the common solution. This article discusses the principles, technology choices, and Alibaba Cloud Storage’s concrete practice of distributed locks.

1 From Single-Machine Lock to Distributed Lock

In a single-machine environment, when a shared resource cannot provide mutual exclusion itself, a third‑party (the kernel or a library) must supply a lock so that only one thread or process can access the resource exclusively. In a distributed environment, a distributed lock service provides the same capability across multiple machines.

From single-machine lock to distributed lock
From single-machine lock to distributed lock

Abstractly, a distributed lock can be expressed as:

Lock = Resource + Concurrency Control + Ownership Display

Examples of single-machine locks:

Spinlock = BOOL + CAS (optimistic lock)

Mutex = BOOL + CAS + Notification (pessimistic lock)

Both spinlock and mutex rely on an atomic CAS operation on a Boolean resource. Without explicit ownership display, a simple atomic integer is not considered a lock.

In distributed environments, the lock service must also provide availability to handle machine failures.

Features and implementation of distributed lock
Features and implementation of distributed lock

2 Distributed Lock System Classification

Based on the safety of the lock resource, distributed locks fall into two camps:

Systems based on asynchronous replication (e.g., MySQL, Tair, Redis).

Systems based on the Paxos protocol (e.g., Zookeeper, etcd, Consul).

Asynchronous‑replication systems risk lock loss and usually rely on TTL for short‑lived locks, suitable for time‑sensitive tasks where occasional loss is tolerable.

Paxos‑based systems provide strong consistency via multi‑replica data and lease mechanisms, suitable for long‑lived locks where safety is critical.

3 Alibaba Cloud Storage Distributed Lock

Alibaba Cloud Storage has accumulated extensive experience in improving correctness, availability, and switch efficiency of distributed locks.

1 Strict Mutual Exclusion

The service binds each lock to a unique session; the client sends periodic heartbeats to keep the session alive. If the heartbeat stops, the session and its lock are released, preventing “one lock, multiple owners”.

Lock usage scenario
Lock usage scenario

Even with strict mutual exclusion, a lock can be broken if a client’s operation exceeds the lock’s TTL. The article illustrates this “boundary” scenario and recommends ensuring sufficient TTL or using rollback mechanisms.

Boundary scenario
Boundary scenario

To address this, the storage system can introduce an IOFence capability, providing token‑based write protection similar to the Chubby paper.

IOFence capability
IOFence capability

Redis’s Redlock attempts to solve lock loss with a majority‑vote mechanism but still suffers from unreliable wall‑clock time and heterogeneous system limitations.

Unreliable wall‑clock (NTP) time.

Inability of heterogeneous systems to guarantee strict correctness.

Monotonic time can mitigate wall‑clock issues, but heterogeneous systems still face challenges, as shown in an example where GC causes a client to lose lock ownership.

Heterogeneous system correctness issue
Heterogeneous system correctness issue

By using a global token that increments with each lock acquisition, the storage layer can reject stale requests, ensuring data protection.

2 Availability

Continuous heartbeats keep the lock robust, but a dead client may still hold a lock. To safely release such locks, the session can be blacklisted, preventing further heartbeats and allowing the session to expire.

Deleting a lock directly is unsafe because another client may have already acquired it.

Even after deletion, the original holder may still believe it owns the lock, breaking mutual exclusion.

3 Switch Efficiency

When a lock holder crashes, a new client must wait for the session to expire before acquiring the lock. Reducing the session lifetime and increasing heartbeat frequency improves switch precision but adds load to the backend.

Client and server maintain expiration times
Client and server maintain expiration times

By storing a unique identifier in the lock node, a CAS‑based forced release can instantly free a dead lock, eliminating wait time.

4 Conclusion

Distributed locks provide exclusive access to shared resources in distributed environments. When integrating a lock service, one must consider integration cost, reliability, switch precision, and correctness. Proper and continuous optimization of distributed lock usage is essential for robust systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed-lockstorageAlibaba CloudAvailability
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.