Understanding Zookeeper's ZAB Protocol: Leader Election, Data Synchronization, and Consistency
This article explains Zookeeper's ZAB (Zookeeper Atomic Broadcast) protocol, detailing node roles, the leader election process during startup and failure, the two‑phase commit data synchronization, ordering guarantees, and scenarios of leader crash and data loss, providing clear diagrams and code examples.
Hello everyone, I am Wukong. This is the ninth article in the distributed‑protocol series, focusing on the famous ZAB protocol in Zookeeper.
Zookeeper Node Roles
Leader
read/write transaction requests are processed by the Leader; these requests have ACID properties.
The Leader synchronizes write transaction requests to other nodes while preserving order.
Status: LEADING .
Follower
Handles client read requests.
Forwards write transactions to the Leader.
Participates in Leader election.
Status: FOLLOWING .
Observer
Similar to a Follower but does not participate in Leader election; status is OBSERVING . It can be used to linearly scale read QPS.
Startup Phase: How is a Leader Chosen?
When the cluster starts, nodes vote for a Leader.
Example with two nodes A and B:
Node A votes for itself with SID and ZXID (e.g., (1,0)).
Node B votes for itself with (2,0).
Both nodes broadcast their votes to the whole cluster.
Each node checks the other's state (LOOKING) and compares ZXID; if ZXIDs are equal, the larger SID wins.
After voting, the nodes count votes; if a majority have the same vote, that node becomes Leader.
Node A becomes a Follower (FOLLOWING) and Node B becomes the Leader (LEADING).
What Happens When the Leader Crashes During Operation?
If the current Leader fails, the remaining Followers (Observers do not vote) start a new election using the same mechanism. The node with the highest ZXID (or highest SID if ZXIDs tie) becomes the new Leader.
How Do Nodes Synchronize Data?
Clients may connect to either the Leader or a Follower. If a client writes to a Follower, the request is forwarded to the Leader, which then performs a two‑phase commit (2PC) to synchronize data.
Two‑Phase Commit
Phase 1: Leader sends a proposal to Followers; Followers ack the proposal. If a majority ack, the process proceeds.
Phase 2: Leader loads the proposal from its disk log into memory, then sends a commit message. Followers load the data into memory upon receiving the commit.
Data flow when a client writes:
Client sends a write transaction.
Leader creates a proposal (e.g., "proposal01:zxid1") and writes it to its disk log.
Leader broadcasts the proposal to Followers.
Followers write the proposal to their own disk logs.
Followers ack the proposal.
Leader receives a majority of acks and proceeds to Phase 2.
Leader loads the proposal into the in‑memory znode structure.
Leader sends a commit to all Followers and Observers.
Followers load the committed data into memory.
How Does ZAB Ensure Ordered Consistency?
Leader places each proposal into a per‑Follower queue, guaranteeing that Followers process proposals in the same order they were received.
Example proposals from three client writes:
proposal01:zxid1
proposal02:zxid2
proposal03:zxid3Leader enqueues them sequentially; Followers dequeue in the same order, preserving sequence consistency.
Is Zookeeper Strongly Consistent?
Official definition: sequential consistency.
Zookeeper does not guarantee strong consistency because Followers may commit at slightly different times due to network delays. After all nodes have committed, the data becomes consistent. Users can invoke the sync API to force a strong‑consistency barrier.
Leader Crash Data‑Loss Scenarios
Scenario 1: Leader wrote to its local disk but did not broadcast the proposal before crashing. A new Leader is elected; its ZXID high 32‑bit part increments, indicating a new Leader version. When the old Leader rejoins as a Follower, it discards any proposals with a lower high‑bit ZXID.
Scenario 2: Leader sent commit messages, but some Followers have not yet applied the proposal when the Leader crashes. The new Leader is chosen as the node with the highest ZXID (or highest SID if tied).
The article aims to explain these mechanisms in plain language with diagrams.
Reference: "From Paxos to Zookeeper – Distributed Consistency Principles and Practice" (https://time.geekbang.org/column/article/229975).
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.