Fundamentals 17 min read

ZooKeeper Overview: Architecture, Data Model, Sessions, Watches, Consistency Guarantees, Leader Election and Zab Protocol

This article provides a comprehensive introduction to ZooKeeper, covering its purpose, design goals, hierarchical data model, znode types, client sessions, watch mechanism, consistency guarantees, leader election process, leader and follower workflows, and the Zab atomic broadcast protocol.

Top Architect
Top Architect
Top Architect
ZooKeeper Overview: Architecture, Data Model, Sessions, Watches, Consistency Guarantees, Leader Election and Zab Protocol

ZooKeeper Introduction

ZooKeeper is an open‑source distributed coordination service that offers a simple set of primitives for building synchronization, configuration management, and naming services in distributed applications.

Design Goals

Strong Consistency: All clients see the same view regardless of which server they connect to.

Reliability: Once a message is accepted by one server, it is eventually accepted by all servers.

Timeliness: Clients receive updates within a bounded time interval, though they may need to call sync() for the freshest data.

Wait‑Free: Slow or failed clients do not block fast clients.

Atomicity: Updates either succeed completely or fail without partial state.

Ordering: Global ordering guarantees that messages are delivered in the same order on all servers; partial ordering ensures a sender’s own messages preserve order.

Data Model

ZooKeeper maintains a hierarchical namespace similar to a file system, where each node is called a znode and is uniquely identified by its path (e.g., /NameService/Server1 ).

Key characteristics of the znode structure:

Each znode can have child znodes and store data; EPHEMERAL nodes cannot have children.

Every znode is versioned; each write increments the version number.

Node types include: Persistent: Remains after server restarts. Ephemeral: Deleted when the client session ends. Non‑sequential: Created with the exact name supplied. Sequential: Name is suffixed with a monotonically increasing decimal sequence.

Watches can be set on znodes to receive notifications of data changes or child‑node modifications.

Each state change generates a globally ordered ZooKeeper Transaction ID (zxid) composed of an epoch and a counter.

Session

A client establishes a TCP connection to the ZooKeeper ensemble, entering a session that can be in CONNECTING or CONNECTED state. If the connection times out, the client automatically retries; the server decides when a session expires.

Watch Mechanism

Watches are one‑time triggers attached to read operations ( getData() , getChildren() , exists() ). When the watched data changes, the server sends a single notification to the client that set the watch. Subsequent changes require the client to set a new watch.

a watch event is a one‑time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes.

Consistency Guarantees

Sequential Consistency: Updates from a single client are applied in order.

Atomicity: Updates are all‑or‑nothing.

Single System Image: All clients see the same state regardless of the server they connect to.

Reliability: Once committed, an update persists until overwritten.

Timeliness: All clients observe the same information within a bounded time.

ZooKeeper Operation

Servers in the ensemble assume one of three roles—leader, follower, or observer—and can be in states such as LOOKING, LEADING, FOLLOWING, or OBSERVING. The core of ZooKeeper is the Zab (ZooKeeper Atomic Broadcast) protocol, which ensures ordered state updates.

Leader Election

When the current leader fails, the ensemble enters recovery mode and runs an election (basic Paxos or fast Paxos). The fast Paxos algorithm is the default: each server proposes itself, exchanges zxids, and the server with the highest zxid that obtains a quorum becomes the new leader.

Leader Workflow

Recover data after a crash.

Maintain heartbeats with followers and process their requests.

Handle follower messages (PING, REQUEST, ACK, REVALIDATE) accordingly.

Follower Workflow

Send PING, REQUEST, ACK, REVALIDATE messages to the leader.

Process messages received from the leader.

Forward client write requests to the leader for voting.

Return results to the client.

Zab: Broadcasting State Updates

ZooKeeper uses a two‑phase commit to decide whether a transaction is committed:

Leader sends a PROPOSAL to all followers.

Followers write the proposal to disk and reply with ACK.

When the leader receives ACKs from a quorum, it sends a COMMIT.

The protocol guarantees that all servers execute transactions in the same order and that at most one leader holds a quorum at any time.

Summary

The article briefly introduced ZooKeeper’s fundamentals, data model, session handling, watch mechanism, consistency guarantees, leader election, leader/follower processing, and the Zab protocol that underpins its reliability and ordering.

References

"ZooKeeper—Distributed Process Coordination" by Flavio Junqueira and Benjamin Reed

ZooKeeper official documentation

IBM DeveloperWorks article on ZooKeeper

Analysis of ZooKeeper consistency algorithm

Zookeeperconsistencydata modelDistributed CoordinationLeader ElectionZab Protocol
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.