What Makes Apache ZooKeeper the Backbone of Distributed Coordination?
This article explains Apache ZooKeeper's role as a reliable distributed coordination service, covering its ZAB protocol, hierarchical data model, node types, watches, session management, ACLs, serialization, cluster architecture, leader election, ZAB algorithm, log cleanup, distributed locks, ID generation, load balancing, and real‑world integrations with Dubbo and Kafka.
Basic Introduction
Apache ZooKeeper, originally a sub‑project of Apache Hadoop, provides an efficient and reliable distributed coordination service for distributed applications.
ZooKeeper does not use Paxos directly for consistency; it uses the ZAB (ZooKeeper Atomic Broadcast) protocol.
ZooKeeper offers features such as data publish/subscribe, load balancing, naming service, distributed coordination/notification, cluster management, master election, distributed lock, and distributed queue.
Key characteristics:
Sequential consistency : transactions from a client are applied in the order they were issued.
Atomicity : a transaction is applied to all servers or to none.
Single view : all clients see the same data model regardless of the server they connect to.
Reliability : once a transaction is committed, its effect persists until changed by another transaction.
Real‑time : after a transaction is committed, clients can read the latest state immediately (within a short period).
Data Model
ZooKeeper stores data in a hierarchical tree similar to a file system, with a fixed root node /. Each level is separated by a slash and must be addressed with an absolute path, e.g., get /work/task.
Why does ZooKeeper not support relative paths?
ZooKeeper is mainly used to locate nodes in the data model and operate on them.
Node lookup is best solved with a hash table, so ZooKeeper stores nodes in a hashtableConcurrentHashMap<String, DataNode> nodes keyed by the full path, which greatly improves performance.
Node types
ZooKeeper defines three node types:
1. Persistent node – remains after the client session ends; must be deleted explicitly.
2. Ephemeral node – disappears when the client session expires or is closed.
3. Sequential node – ZooKeeper appends a monotonically increasing number to the node name when it is created.
Sequential nodes allow easy observation of creation order, e.g., creating works/task- results in works/task-1.
Each data node stores a byte array ( byte data[]), ACL information, optional child list, and a stat structure containing metadata such as czxid, ctime, mzxid, mtime, pzxid, cversion, version, aversion, ephemeralOwner, dataLength, and numChildren.
Node versioning
Every data node has three version counters (data, children, ACL). Any update to the node increments the corresponding version.
Data Storage
Transaction logs and snapshots are stored on local disk; the in‑memory data model represents the live state of the tree.
Transaction logs record local session operations for synchronizing servers.
Snapshots persist the in‑memory tree to disk periodically.
Snapshots are taken at intervals, so the on‑disk data may lag behind the memory state.
When a server crashes, data loss may occur because logs are not yet flushed.
Memory data
The tree is kept in memory by the DataTree class, which holds nodes, the root, and watch information.
Transaction log
The leader forwards client requests to followers and observers; followers and observers synchronize state by applying the leader’s transaction log.
Watch Mechanism
Clients can register watches to be notified when a node’s data or state changes.
Implementation
When creating a ZooKeeper client, a Watcher instance is passed:
new ZooKeeper(String connectString, int sessionTimeout, Watcher watcher)The watcher is stored in ZKWatchManager and remains for the lifetime of the session. Clients can also register watches via getData, exists, and getChildren.
Watch events are triggered only for the session state and the specific event type that was registered.
Publish/subscribe example
Configuration data (e.g., /confs/data_item1) can be stored in ZooKeeper. Clients watch the node; when the data changes, ZooKeeper notifies all watchers, which then re‑read the configuration.
Watch events are one‑time; after receiving a notification, the client must re‑register the watch.
Session Mechanism
A client session consists of a session ID, timeout, and closing flag. Sessions transition through states such as CONNECTING, CONNECTED, RECONNECTING, and CLOSED.
Session timeout is negotiated between client and server; the effective timeout is the one that falls within the server’s allowed range.
Bucket strategy
Sessions are grouped into buckets based on expiration intervals, allowing the server to process expirations in batches rather than scanning every session.
ACL Permissions
ZooKeeper ACLs consist of scheme, ID, and permission. Schemes include IP‑based range validation, Digest (username:password) authentication, and Super (full access). Permissions are create, write, read, delete, and admin.
Custom authentication providers can be registered via the system property -Dzookeeper.authProvider.x=CustomAuthenticationProvider or in zoo.cfg as authProvider.x=CustomAuthenticationProvider.
Serialization
ZooKeeper uses the Jute serialization framework. Classes implement the Record interface with serialize and deserialize methods that write/read fields using OutpurArchive and INputArchive.
class test_jute implements Record {
private long ids;
private String name;
public void serialize(OutpurArchive a_, String tag) { ... }
public void deserialize(INputArchive a_, String tag) { ... }
}Cluster
ZooKeeper clusters consist of Leader, Follower, and Observer servers. The leader handles writes and coordinates replication; followers participate in leader election; observers serve reads only and do not vote.
All transactional requests are forwarded to the leader, which assigns a globally unique ZXID and replicates the operation to followers.
Leader election uses the FastLeaderElection algorithm, where each server votes with fields such as logicClock, state, self_id, self_zxid, vote_id, and vote_zxid. A majority of matching votes selects the new leader.
Observers improve scalability by handling read‑only traffic without participating in elections.
ZAB Protocol
ZAB (ZooKeeper Atomic Broadcast) ensures crash recovery and atomic broadcast. The leader assigns a monotonically increasing 64‑bit ZXID to each transaction. Followers acknowledge proposals; once a majority acknowledges, the transaction is committed.
Log Cleanup
ZooKeeper generates transaction logs and snapshots. Logs are periodically cleaned using Linux crontab scripts or the built‑in PurgeTxnLog utility.
#!/bin/sh
java -cp "$CLASSPATH" org.apache.zookeeper.server.PurgeTxnLog
echo "Cleanup complete"Distributed Lock
ZooKeeper can implement locks using either a single flag node or ordered ephemeral child nodes. The latter avoids deadlocks because the node is removed automatically when the client session expires.
To avoid the “herd effect,” each contender watches the node that precedes it in order, so only the next client is notified when the lock is released.
new ZooKeeper(...);
String lockPath = "/lock/lock01";
String myNode = zk.create(lockPath + "/node_", data, ACL.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL);
...Distributed ID
Sequential nodes can serve as unique, ordered IDs. Clients create a sequential node and use its name as the identifier.
Load Balancing
ZooKeeper can store server load information under a /servers node. Clients read the connection counts, select the server with the fewest connections, increment the count, and decrement it after processing.
Open‑Source Framework Use Cases
Dubbo uses ZooKeeper as a registry: providers register their URLs under /dubbo/com.foo.BarService/providers, and consumers watch this node for changes.
Kafka stores broker metadata and topic partitions in ZooKeeper paths such as /brokers/topics/[topic].
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
