Fundamentals 20 min read

Introduction to ZooKeeper: Design Goals, Data Model, Sessions, Watches, Consistency Guarantees, Leader Election, and Deployment

This article provides a comprehensive overview of ZooKeeper, covering its purpose as a distributed coordination service, design objectives such as consistency and reliability, hierarchical data model, session and watch mechanisms, consistency guarantees, leader election and Zab protocol, as well as practical deployment details.

Java Captain
Java Captain
Java Captain
Introduction to ZooKeeper: Design Goals, Data Model, Sessions, Watches, Consistency Guarantees, Leader Election, and Deployment

ZooKeeper Introduction

ZooKeeper is an open‑source distributed application coordination service that offers a simple set of primitives enabling developers to implement synchronization, configuration maintenance, and naming services.

Design Goals

Final Consistency: All clients see the same view regardless of which server they connect to.

Reliability: If a message is accepted by one server, it will be accepted by all servers.

Timeliness: Clients receive updates or failure notifications within a bounded time interval; for the freshest data, call sync() before reading.

Wait‑free: Slow or failed clients do not interfere with fast clients, allowing each client to wait effectively.

Atomicity: Updates either succeed completely or fail, with no intermediate state.

Ordering: Global ordering ensures that if message a precedes message b on one server, it does so on all servers; partial ordering guarantees that messages from the same sender preserve their order.

Data Model

ZooKeeper maintains a hierarchical namespace similar to a standard file system.

The key characteristics of this model are:

Each node (znode) is uniquely identified by its full path, e.g., /NameService/Server1 .

Znodes may have children and can store data; temporary (EPHEMERAL) znodes cannot have children.

Each znode is versioned; multiple versions of the stored data are kept, and the version number increments automatically.

Node types include: Persistent: Remains after server restarts; may contain data and children. Ephemeral: Deleted automatically when the client session ends. Non‑sequential: Creation succeeds for only one client when multiple attempt simultaneously; the name is exactly as specified. Sequential: The created node name receives a 10‑digit decimal suffix, allowing all concurrent creators to succeed with unique names.

Znodes can be watched for data changes or child‑list modifications; watches are a core feature used by many ZooKeeper functions.

Every state change generates a globally ordered transaction ID (zxid) that determines the order of operations across the ensemble.

Session

A client establishes a connection to a ZooKeeper ensemble, and its session state transitions are illustrated in the accompanying diagram.

If a client loses connection due to a timeout, it enters the CONNECTING state and automatically attempts to reconnect; if reconnection occurs within the session timeout, the client returns to CONNECTED . The server, not the client, decides when a session expires.

Watch

A watch is a one‑time trigger sent to the client that set it, activated when the watched data changes.

a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes。

Key points about watches:

One‑time trigger: After a change, the watch fires once; subsequent changes require the client to set a new watch.

Sent to the client: Watches are delivered asynchronously over the socket; ordering guarantees ensure a client sees the watch before the corresponding data change.

Data specificity: Different watch types (data watches vs. child watches) are triggered by setData() , create() , delete() , etc. If a client disconnects during a watch‑related event, the watch may be lost.

Consistency Guarantees

ZooKeeper provides high performance with fast reads and writes, offering the following guarantees:

Sequential Consistency: Updates from a single client are applied in order.

Atomicity: Updates are all‑or‑nothing.

Single System Image: All clients see the same system state regardless of the server they connect to.

Reliability: Once an update is committed, it persists until overwritten.

Timeliness: Clients see a consistent view within a bounded time window.

How ZooKeeper Works

Each server in a ZooKeeper ensemble assumes one of three roles (leader, follower, observer) and can be in one of four states (LOOKING, LEADING, FOLLOWING, OBSERVING). The core of ZooKeeper is the atomic broadcast protocol (Zab), which ensures ordered state updates.

Leader Election

When the current leader fails, the ensemble enters recovery mode and elects a new leader using either a basic Paxos or a fast Paxos algorithm (fast Paxos is the default). The basic Paxos election proceeds as follows:

The election thread initiates the vote and collects responses.

It sends a query to all servers (including itself).

Responses are validated, and each server’s ID and proposed leader information are recorded.

The server with the highest zxid is selected as the candidate.

If the candidate obtains a majority (n/2 + 1) votes, it becomes the leader; otherwise the process repeats.

The fast Paxos election has each server propose itself as leader, resolve epoch and zxid conflicts, and converge on a single leader.

Leader Workflow

The leader performs three main functions:

Recover data after a crash.

Maintain heartbeats with followers and process follower requests.

Handle follower messages such as PING, REQUEST, ACK, and REVALIDATE.

Follower Workflow

Followers:

Send PING, REQUEST, ACK, and REVALIDATE messages to the leader.

Receive and process messages from the leader.

Forward client write requests to the leader for voting.

Return results to clients.

Follower message types include PING (heartbeat), PROPOSAL (leader’s proposal), COMMIT (finalized transaction), UPTODATE (sync completion), REVALIDATE (session validation), and SYNC (client‑initiated state sync).

Zab: Broadcasting State Updates

When a server receives a request, followers forward it to the leader, which executes the request and broadcasts it as a transaction. Commitment follows a two‑phase commit:

Leader sends a PROPOSAL to all followers.

Each follower writes the proposal to disk and replies with an ACK.

Once the leader receives ACKs from a quorum, it sends a COMMIT.

The protocol guarantees total order of transactions across the ensemble and handles leader crashes by ensuring that any transaction committed by a crashed leader is re‑committed by the new leader.

Deployment

Basic Information Table

Hostname

OS Version

IP Address

Installed Software

zookeeper-230

CentOS 7.7

192.168.15.230

JDK1.8, zookeeper‑3.6.2

zookeeper-231

CentOS 7.7

192.168.15.231

JDK1.8, zookeeper‑3.6.2

zookeeper-232

CentOS 7.7

192.168.15.232

JDK1.8, zookeeper‑3.6.2

System Information

实验虚拟机配置1c2g25G
[root@zookeeper-230 ~]# uname -a
Linux zookeeper-230 3.10.0-1062.18.1.el7.x86_64  #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@zookeeper-230 ~]# rpm -q centos-release
centos-release-7-7.1908.0.el7.centos.x86_64

Application Information

Project

Path

Application Path

/usr/local/zookeeper3.6

Configuration Path

/usr/local/zookeeper3.6/conf

Default Log Path

/usr/local/zookeeper3.6/logs

Custom Snapshot Log Path

/usr/local/zookeeper3.6/zkdata

Custom Transaction Log Path

/usr/local/zookeeper3.6/zklogs

Author: 二价亚铁

Original link: https://www.cnblogs.com/xw-01/p/18263814

License: CC BY‑NC‑ND 2.5 China Mainland.

Zookeeperdata modelConsensusDistributed CoordinationLeader ElectionZab ProtocolWatch
Java Captain
Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.