ZooKeeper Overview: Architecture, Data Model, Sessions, Watches, Consistency Guarantees, Leader Election and Zab Protocol
This article provides a comprehensive introduction to ZooKeeper, covering its purpose, design goals, hierarchical data model, znode types, client sessions, watch mechanism, consistency guarantees, leader election process, leader and follower workflows, and the Zab atomic broadcast protocol.
ZooKeeper Introduction
ZooKeeper is an open‑source distributed coordination service that offers a simple set of primitives for building synchronization, configuration management, and naming services in distributed applications.
Design Goals
Strong Consistency: All clients see the same view regardless of which server they connect to.
Reliability: Once a message is accepted by one server, it is eventually accepted by all servers.
Timeliness: Clients receive updates within a bounded time interval, though they may need to call sync() for the freshest data.
Wait‑Free: Slow or failed clients do not block fast clients.
Atomicity: Updates either succeed completely or fail without partial state.
Ordering: Global ordering guarantees that messages are delivered in the same order on all servers; partial ordering ensures a sender’s own messages preserve order.
Data Model
ZooKeeper maintains a hierarchical namespace similar to a file system, where each node is called a znode and is uniquely identified by its path (e.g., /NameService/Server1 ).
Key characteristics of the znode structure:
Each znode can have child znodes and store data; EPHEMERAL nodes cannot have children.
Every znode is versioned; each write increments the version number.
Node types include: Persistent: Remains after server restarts. Ephemeral: Deleted when the client session ends. Non‑sequential: Created with the exact name supplied. Sequential: Name is suffixed with a monotonically increasing decimal sequence.
Watches can be set on znodes to receive notifications of data changes or child‑node modifications.
Each state change generates a globally ordered ZooKeeper Transaction ID (zxid) composed of an epoch and a counter.
Session
A client establishes a TCP connection to the ZooKeeper ensemble, entering a session that can be in CONNECTING or CONNECTED state. If the connection times out, the client automatically retries; the server decides when a session expires.
Watch Mechanism
Watches are one‑time triggers attached to read operations ( getData() , getChildren() , exists() ). When the watched data changes, the server sends a single notification to the client that set the watch. Subsequent changes require the client to set a new watch.
a watch event is a one‑time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes.
Consistency Guarantees
Sequential Consistency: Updates from a single client are applied in order.
Atomicity: Updates are all‑or‑nothing.
Single System Image: All clients see the same state regardless of the server they connect to.
Reliability: Once committed, an update persists until overwritten.
Timeliness: All clients observe the same information within a bounded time.
ZooKeeper Operation
Servers in the ensemble assume one of three roles—leader, follower, or observer—and can be in states such as LOOKING, LEADING, FOLLOWING, or OBSERVING. The core of ZooKeeper is the Zab (ZooKeeper Atomic Broadcast) protocol, which ensures ordered state updates.
Leader Election
When the current leader fails, the ensemble enters recovery mode and runs an election (basic Paxos or fast Paxos). The fast Paxos algorithm is the default: each server proposes itself, exchanges zxids, and the server with the highest zxid that obtains a quorum becomes the new leader.
Leader Workflow
Recover data after a crash.
Maintain heartbeats with followers and process their requests.
Handle follower messages (PING, REQUEST, ACK, REVALIDATE) accordingly.
Follower Workflow
Send PING, REQUEST, ACK, REVALIDATE messages to the leader.
Process messages received from the leader.
Forward client write requests to the leader for voting.
Return results to the client.
Zab: Broadcasting State Updates
ZooKeeper uses a two‑phase commit to decide whether a transaction is committed:
Leader sends a PROPOSAL to all followers.
Followers write the proposal to disk and reply with ACK.
When the leader receives ACKs from a quorum, it sends a COMMIT.
The protocol guarantees that all servers execute transactions in the same order and that at most one leader holds a quorum at any time.
Summary
The article briefly introduced ZooKeeper’s fundamentals, data model, session handling, watch mechanism, consistency guarantees, leader election, leader/follower processing, and the Zab protocol that underpins its reliability and ordering.
References
"ZooKeeper—Distributed Process Coordination" by Flavio Junqueira and Benjamin Reed
ZooKeeper official documentation
IBM DeveloperWorks article on ZooKeeper
Analysis of ZooKeeper consistency algorithm
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.