Fundamentals 14 min read

How ZooKeeper Guarantees Sequential Order in Distributed Read‑Write Locks

This article explores how Alibaba’s Nuwa service and the open‑source ZooKeeper implement distributed read‑write locks using sequential files, detailing the challenges of maintaining cversion order during failover and the community’s solutions to ensure consistency and reliability.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How ZooKeeper Guarantees Sequential Order in Distributed Read‑Write Locks

1. Origin of the Read‑Write Lock Problem

Apache ZooKeeper (ZK) is a widely used open‑source coordination service that provides primitives for building distributed locks, service discovery, and metadata storage. Besides regular and ephemeral nodes, ZK offers sequential nodes whose names include a monotonically increasing suffix derived from the parent directory’s cversion value. By creating sequential nodes such as /vm/vsp‑xxxxx/_READ_ and /vm/vsp‑xxxxx/_WRIT_, the suffix order can be used to implement a distributed read‑write lock: a client treats the smallest‑suffix read node as holding a read lock if no smaller‑suffix write node exists, and similarly for write locks.

The correctness of this mechanism relies on the strict ordering of sequential node suffixes, even across failover events.

2. Risks During Failover

2.1 Idempotence Trap

ZooKeeper follows a replicated state‑machine model that keeps all state in memory, providing high‑performance access. Snapshots are taken asynchronously: the in‑memory DataTree is traversed and persisted to disk while the system continues to process transactions. Consequently, a snapshot captures a “mixed” state that does not correspond to any single point in time.

When replaying transaction logs after a failover, the system first loads the latest snapshot and then re‑applies transactions (e.g., op1, op2, op3) in order. Because the snapshot may already include the effects of some of these operations, they are applied again, which is a common issue for asynchronous snapshot designs.

2.2 Order Not Order

In early ZK versions, the replay logic was not separated from normal transaction processing. Deleting a non‑existent node caused the delete transaction to fail without updating cversion, leading to a possible rollback of cversion after failover.

3. Community’s First Intervention

The open‑source community responded by separating normal transaction handling from failover replay. During replay, a NoNodeException on delete now also updates the parent’s cversion, ensuring that cversion always increases after a failover.

4. Community’s Second Intervention

Further analysis revealed that updating cversion only during replay was not idempotent; different ZK nodes could diverge on the same directory’s cversion. The final fix was to record the parent’s cversion directly in the transaction log. Only the create transaction carries a parentCVersion field; the delete transaction’s effect can be derived by subtracting the number of remaining child nodes from this value, reducing the change surface.

During log replay, the system reads the parentCVersion from each create transaction and uses it to reconstruct the correct cversion for the directory, guaranteeing consistency without additional modifications.

5. Lessons Learned

Since its inception in 2009, Alibaba’s Nuwa service has continuously refined distributed coordination techniques. Consensus protocols such as Paxos, Raft, or ZAB, combined with lease mechanisms, provide safe coarse‑grained locks similar to those described in Google’s Chubby paper. Maintaining tiny details like cversion consistency is crucial; overlooking them can cause severe production incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsZooKeeperread-write lockfailoverConsensuscversion
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.