How ZooKeeper Guarantees Sequential Order in Distributed Read‑Write Locks
This article explores how Alibaba’s Nuwa service and the open‑source ZooKeeper implement distributed read‑write locks using sequential files, detailing the challenges of maintaining cversion order during failover and the community’s solutions to ensure consistency and reliability.
1. Origin of the Read‑Write Lock Problem
Apache ZooKeeper (ZK) is a widely used open‑source coordination service that provides primitives for building distributed locks, service discovery, and metadata storage. Besides regular and ephemeral nodes, ZK offers sequential nodes whose names include a monotonically increasing suffix derived from the parent directory’s cversion value. By creating sequential nodes such as /vm/vsp‑xxxxx/_READ_ and /vm/vsp‑xxxxx/_WRIT_, the suffix order can be used to implement a distributed read‑write lock: a client treats the smallest‑suffix read node as holding a read lock if no smaller‑suffix write node exists, and similarly for write locks.
The correctness of this mechanism relies on the strict ordering of sequential node suffixes, even across failover events.
2. Risks During Failover
2.1 Idempotence Trap
ZooKeeper follows a replicated state‑machine model that keeps all state in memory, providing high‑performance access. Snapshots are taken asynchronously: the in‑memory DataTree is traversed and persisted to disk while the system continues to process transactions. Consequently, a snapshot captures a “mixed” state that does not correspond to any single point in time.
When replaying transaction logs after a failover, the system first loads the latest snapshot and then re‑applies transactions (e.g., op1, op2, op3) in order. Because the snapshot may already include the effects of some of these operations, they are applied again, which is a common issue for asynchronous snapshot designs.
2.2 Order Not Order
In early ZK versions, the replay logic was not separated from normal transaction processing. Deleting a non‑existent node caused the delete transaction to fail without updating cversion, leading to a possible rollback of cversion after failover.
3. Community’s First Intervention
The open‑source community responded by separating normal transaction handling from failover replay. During replay, a NoNodeException on delete now also updates the parent’s cversion, ensuring that cversion always increases after a failover.
4. Community’s Second Intervention
Further analysis revealed that updating cversion only during replay was not idempotent; different ZK nodes could diverge on the same directory’s cversion. The final fix was to record the parent’s cversion directly in the transaction log. Only the create transaction carries a parentCVersion field; the delete transaction’s effect can be derived by subtracting the number of remaining child nodes from this value, reducing the change surface.
During log replay, the system reads the parentCVersion from each create transaction and uses it to reconstruct the correct cversion for the directory, guaranteeing consistency without additional modifications.
5. Lessons Learned
Since its inception in 2009, Alibaba’s Nuwa service has continuously refined distributed coordination techniques. Consensus protocols such as Paxos, Raft, or ZAB, combined with lease mechanisms, provide safe coarse‑grained locks similar to those described in Google’s Chubby paper. Maintaining tiny details like cversion consistency is crucial; overlooking them can cause severe production incidents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
