Fundamentals 16 min read

Understanding Zookeeper: Architecture, Nodes, Sessions, Watchers, Leader Election, and Consistency

This article provides a comprehensive overview of Zookeeper, covering its purpose, cluster roles, Znode types, session handling, watcher mechanism, ACL permissions, common use cases, data consistency via the ZAB protocol, leader election, synchronization methods, potential inconsistency scenarios, and a comparison with other service‑registry solutions.

Wukong Talks Architecture

Mar 19, 2021

Understanding Zookeeper: Architecture, Nodes, Sessions, Watchers, Leader Election, and Consistency

What is Zookeeper?

Zookeeper is an open‑source distributed coordination service originally created by Yahoo; its name reflects the role of a zoo keeper in managing distributed components.

Its goals are high performance, high availability, and ordered access control, addressing data consistency in distributed environments.

Cluster Roles

A Zookeeper ensemble consists of Leader, Follower, and Observer nodes. Only the Leader can perform writes; Followers can read and participate in leader election, while Observers are read‑only and do not take part in quorum writes, improving performance.

Clusters are recommended to have an odd number of nodes; as long as a majority are operational, the ensemble remains available.

Data Nodes (Znode)

Data is stored in memory as hierarchical Znodes (e.g., /a/b/c). Znodes are of three types: persistent, ephemeral, and sequential.

Persistent nodes remain until explicitly deleted. Ephemeral nodes exist only for the duration of the client session. Sequential nodes (both persistent and ephemeral) receive an ordered suffix when created.

Session

A session is a long‑lived TCP connection between a client and a Zookeeper server, with heartbeats and the ability to receive watch events.

Watcher Mechanism

Clients can register watchers on specific Znodes. When the watched event occurs, the server sends a one‑time notification (WatchedEvent) containing the event type, state, and node path; the client must then fetch the updated data.

Watcher callbacks are executed serially, and the notification payload is lightweight.

Client registers a watcher.

The watcher object is stored in the client’s WatcherManager.

When the server triggers the event, the client retrieves the watcher and executes its callback.

ACL Permission Control

Zookeeper uses Access Control Lists (ACLs) with five permission types:

CREATE – permission to create child nodes.

DELETE – permission to delete child nodes.

READ – permission to read node data and list children.

WRITE – permission to update node data.

ADMIN – permission to set ACLs.

Typical Application Scenarios

Name service – generating globally unique IDs for resources.

Distributed coordination – using watchers to notify other components of state changes.

Cluster management – storing cluster state.

Master election – leveraging Zookeeper’s uniqueness to elect a leader.

Distributed lock – using temporary sequential nodes.

How Zookeeper Guarantees Data Consistency

Zookeeper employs the ZAB (Zookeeper Atomic Broadcast) protocol, a two‑phase‑commit‑like process that ensures total order of writes.

The Leader receives a write request, creates a proposal, assigns a monotonically increasing zxid, and broadcasts it to Followers.

Followers write the proposal to their local log and ACK the Leader.

When a majority ACKs, the Leader commits the proposal and notifies Followers.

ZAB operates in two modes: crash recovery (leader election after failures) and message broadcast (normal operation).

Leader Election Process

Election occurs during startup and during runtime when the current leader fails.

During startup, each server votes for itself, broadcasts its vote (zxid, myid), and the node with the highest zxid (or highest myid if zxids tie) that receives a majority becomes the Leader.

During runtime, if the Leader crashes, non‑observer nodes revert to LOOKING state and repeat the voting process.

Data Synchronization After Election

Followers and Observers (Learners) register with the new Leader and synchronize data using four methods:

DIFF : Learner’s last zxid lies between the Leader’s min and max committed logs; the Leader sends only the missing proposals.

TRUNC+DIFF : Leader sends a TRUNC command to roll back the Learner’s log before applying DIFF.

TRUNC : Learner’s last zxid exceeds the Leader’s max; the Leader truncates the Learner’s log.

SNAP : Full snapshot transfer when Learner’s last zxid is older than the Leader’s min committed log or when the Leader has no proposal cache.

Potential Inconsistency Scenarios

Three cases can cause inconsistency:

Read inconsistency : A write succeeds on a majority but a read directed to a minority node may not see the update. Using the sync command before reads mitigates this.

Leader crashes before sending proposal : Proposals generated but not broadcast are lost.

Leader crashes after proposal but before commit : The proposal is retained; the new Leader (with the highest zxid) will replay it, preserving consistency.

Zookeeper vs. Other Service Registries

Nacos

Eureka

Consul

Zookeeper

Consistency protocol

CP+AP

Health check

TCP/HTTP/MySQL/Client Beat

Client Beat

TCP/HTTP/gRPC/Cmd

Keep Alive

Load‑balancing strategy

Weight/metadata/Selector

Ribbon

Fabio

—

Avalanche protection

Yes

Auto‑deregister

Supported

Not supported

Supported

Access protocol

HTTP/DNS

HTTP

HTTP/DNS

TCP

Watch support

Supported

Multi‑data‑center

Supported

Not supported

Cross‑registry sync

Supported

Not supported

Supported

Not supported

Spring Cloud integration

Supported

Not supported

Dubbo integration

Supported

Not supported

Supported

K8s integration

Supported

Not supported

Supported

Not supported

Understanding the CAP Theorem

CAP states that a distributed system can simultaneously provide at most two of the following three guarantees: Consistency, Availability, and Partition‑tolerance.

Because network partitions are inevitable (P must hold), systems must choose between C and A. Zookeeper prioritizes Consistency and Partition‑tolerance (CP), sacrificing Availability during leader election or network splits.

References: https://my.oschina.net/yunqi/blog/3040280 《从 Paxos 到 Zookeeper 分布式一致性原理与实践》

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Zookeeper Data Consistency Service Registry Consensus Leader Election

Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.