Understanding Zookeeper: Architecture, Nodes, Sessions, Watchers, Leader Election, and Consistency
This article provides a comprehensive overview of Zookeeper, covering its purpose, cluster roles, Znode types, session handling, watcher mechanism, ACL permissions, common use cases, data consistency via the ZAB protocol, leader election, synchronization methods, potential inconsistency scenarios, and a comparison with other service‑registry solutions.
What is Zookeeper?
Zookeeper is an open‑source distributed coordination service originally created by Yahoo; its name reflects the role of a zoo keeper in managing distributed components.
Its goals are high performance, high availability, and ordered access control, addressing data consistency in distributed environments.
Cluster Roles
A Zookeeper ensemble consists of Leader, Follower, and Observer nodes. Only the Leader can perform writes; Followers can read and participate in leader election, while Observers are read‑only and do not take part in quorum writes, improving performance.
Clusters are recommended to have an odd number of nodes; as long as a majority are operational, the ensemble remains available.
Data Nodes (Znode)
Data is stored in memory as hierarchical Znodes (e.g., /a/b/c). Znodes are of three types: persistent, ephemeral, and sequential.
Persistent nodes remain until explicitly deleted. Ephemeral nodes exist only for the duration of the client session. Sequential nodes (both persistent and ephemeral) receive an ordered suffix when created.
Session
A session is a long‑lived TCP connection between a client and a Zookeeper server, with heartbeats and the ability to receive watch events.
Watcher Mechanism
Clients can register watchers on specific Znodes. When the watched event occurs, the server sends a one‑time notification (WatchedEvent) containing the event type, state, and node path; the client must then fetch the updated data.
Watcher callbacks are executed serially, and the notification payload is lightweight.
Client registers a watcher.
The watcher object is stored in the client’s WatcherManager.
When the server triggers the event, the client retrieves the watcher and executes its callback.
ACL Permission Control
Zookeeper uses Access Control Lists (ACLs) with five permission types:
CREATE – permission to create child nodes.
DELETE – permission to delete child nodes.
READ – permission to read node data and list children.
WRITE – permission to update node data.
ADMIN – permission to set ACLs.
Typical Application Scenarios
Name service – generating globally unique IDs for resources.
Distributed coordination – using watchers to notify other components of state changes.
Cluster management – storing cluster state.
Master election – leveraging Zookeeper’s uniqueness to elect a leader.
Distributed lock – using temporary sequential nodes.
How Zookeeper Guarantees Data Consistency
Zookeeper employs the ZAB (Zookeeper Atomic Broadcast) protocol, a two‑phase‑commit‑like process that ensures total order of writes.
The Leader receives a write request, creates a proposal, assigns a monotonically increasing zxid, and broadcasts it to Followers.
Followers write the proposal to their local log and ACK the Leader.
When a majority ACKs, the Leader commits the proposal and notifies Followers.
ZAB operates in two modes: crash recovery (leader election after failures) and message broadcast (normal operation).
Leader Election Process
Election occurs during startup and during runtime when the current leader fails.
During startup, each server votes for itself, broadcasts its vote (zxid, myid), and the node with the highest zxid (or highest myid if zxids tie) that receives a majority becomes the Leader.
During runtime, if the Leader crashes, non‑observer nodes revert to LOOKING state and repeat the voting process.
Data Synchronization After Election
Followers and Observers (Learners) register with the new Leader and synchronize data using four methods:
DIFF : Learner’s last zxid lies between the Leader’s min and max committed logs; the Leader sends only the missing proposals.
TRUNC+DIFF : Leader sends a TRUNC command to roll back the Learner’s log before applying DIFF.
TRUNC : Learner’s last zxid exceeds the Leader’s max; the Leader truncates the Learner’s log.
SNAP : Full snapshot transfer when Learner’s last zxid is older than the Leader’s min committed log or when the Leader has no proposal cache.
Potential Inconsistency Scenarios
Three cases can cause inconsistency:
Read inconsistency : A write succeeds on a majority but a read directed to a minority node may not see the update. Using the sync command before reads mitigates this.
Leader crashes before sending proposal : Proposals generated but not broadcast are lost.
Leader crashes after proposal but before commit : The proposal is retained; the new Leader (with the highest zxid) will replay it, preserving consistency.
Zookeeper vs. Other Service Registries
Nacos
Eureka
Consul
Zookeeper
Consistency protocol
CP+AP
AP
CP
CP
Health check
TCP/HTTP/MySQL/Client Beat
Client Beat
TCP/HTTP/gRPC/Cmd
Keep Alive
Load‑balancing strategy
Weight/metadata/Selector
Ribbon
Fabio
—
Avalanche protection
Yes
Yes
No
No
Auto‑deregister
Supported
Supported
Not supported
Supported
Access protocol
HTTP/DNS
HTTP
HTTP/DNS
TCP
Watch support
Supported
Supported
Supported
Supported
Multi‑data‑center
Supported
Supported
Supported
Not supported
Cross‑registry sync
Supported
Not supported
Supported
Not supported
Spring Cloud integration
Supported
Supported
Supported
Not supported
Dubbo integration
Supported
Not supported
Not supported
Supported
K8s integration
Supported
Not supported
Supported
Not supported
Understanding the CAP Theorem
CAP states that a distributed system can simultaneously provide at most two of the following three guarantees: Consistency, Availability, and Partition‑tolerance.
Because network partitions are inevitable (P must hold), systems must choose between C and A. Zookeeper prioritizes Consistency and Partition‑tolerance (CP), sacrificing Availability during leader election or network splits.
References: https://my.oschina.net/yunqi/blog/3040280 《从 Paxos 到 Zookeeper 分布式一致性原理与实践》
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.