Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications
This article introduces ZooKeeper’s fundamental architecture, explains its key concepts such as cluster roles, sessions, ZNodes, watches, and ACLs, and then details how it powers essential distributed coordination tasks—including configuration management, naming services, master election, and distributed locks—in large‑scale Hadoop and HBase ecosystems.
Preface
ZooKeeper is an open‑source distributed coordination service created by Yahoo, an implementation of Google’s Chubby. Applications can use ZooKeeper for data publish/subscribe, load balancing, naming, coordination/notification, cluster management, master election, distributed locks and queues.
1. Introduction
ZooKeeper is an open‑source distributed coordination service created by Yahoo, an implementation of Google’s Chubby. Applications can use ZooKeeper for data publish/subscribe, load balancing, naming, coordination/notification, cluster management, master election, distributed locks and queues.
2. Basic Concepts
This section introduces the core concepts of ZooKeeper that are referenced throughout the article.
2.1 Cluster Roles
ZooKeeper defines three roles: Leader Follower Observer
At any moment a ZooKeeper ensemble has a single Leader; the others are Followers or Observers.
All nodes share the same zoo.cfg file; only the myid file differs. The myid value must match the server.{id} entry in zoo.cfg.
Example zoo.cfg:
Running
zookeeper-server statuson a machine shows its role (Leader or Follower).
In the example, node‑20‑104 is Leader, node‑20‑103 is Follower.
ZooKeeper by default supports only Leader and Follower. To enable Observer mode, add
peerType=observerto the configuration of any node that should be an observer and append
:observerto the server line, e.g.
server.1:localhost:2888:3888:observerThe Leader provides both read and write services to clients; Followers and Observers provide only read services. Observers do not participate in leader election or the majority‑write quorum, allowing them to improve read throughput without affecting write performance.
2.2 Session
A session represents a client’s long‑lived TCP connection to ZooKeeper. The default client port is 2181. When a client connects, a session starts; heartbeats keep it alive, and the client can send requests and receive watch events.
The SessionTimeout defines how long a session may be inactive before ZooKeeper considers it expired. If the client reconnects to any server within this timeout, the session remains valid.
2.3 ZNode
In ZooKeeper, a ZNode is a data node in a hierarchical tree (paths separated by “/”, e.g., /hbase/master). Each ZNode stores data and metadata. ZNodes can be persistent or ephemeral.
Note: A ZNode is analogous to both a file and a directory in a Unix file system.
Persistent ZNodes remain until explicitly deleted. Ephemeral ZNodes are tied to the client session and disappear when the session ends.
ZooKeeper also supports the SEQUENTIAL flag, which appends an auto‑incremented integer to the node name at creation time.
2.4 Version
Each ZNode has a Stat structure that records three version numbers:
version(data version),
cversion(children version), and
aversion(ACL version).
2.5 Stat Information
Using the
getcommand returns both the data and the Stat of a ZNode. The
versionfield is used for optimistic locking to ensure atomic updates.
2.6 Transaction Operations
Operations that modify ZooKeeper state (node create/delete, data update, session create/expire) are transactions. Each transaction receives a globally unique 64‑bit ZXID, which defines a total order of updates.
2.7 Watcher
Watchers allow clients to register interest in specific ZNode events. When the event occurs, ZooKeeper notifies the interested clients, enabling distributed coordination.
2.8 ACL
ZooKeeper uses Access Control Lists (ACLs) with five permissions: CREATE, READ, WRITE, DELETE, ADMIN. CREATE and DELETE apply only to child nodes.
Note: CREATE and DELETE are permissions on child nodes.
3. Typical ZooKeeper Use Cases
ZooKeeper’s high availability and strong consistency make it a powerful tool for solving distributed coordination problems.
3.1 Data Publish/Subscribe (Configuration Center)
Publishers write configuration data to ZooKeeper nodes; subscribers watch those nodes to receive updates. Configuration data is typically small, changes at runtime, and is shared across the cluster.
Two design patterns exist: push (server pushes updates) and pull (client polls). ZooKeeper combines both: clients register watches, and when data changes the server notifies the client, which then pulls the latest data.
3.2 Naming Service
Clients can resolve a logical name to a service address by reading a ZNode. Sequential nodes enable generation of globally unique identifiers, useful for RPC frameworks.
3.3 Distributed Coordination / Notification
Clients register watches on a ZNode; when the node changes, all interested clients receive notifications, allowing real‑time handling of data changes.
3.3.1 Heartbeat Detection
Processes create temporary child nodes under a designated ZNode. The existence of a child node indicates the process is alive; its disappearance signals failure, enabling decoupled heartbeat monitoring.
3.3.2 Progress Reporting
Tasks create temporary nodes to report progress. Other components can read these nodes to monitor execution status.
3.4 Master Election
ZooKeeper is widely used for master election in systems such as HDFS NameNode, YARN ResourceManager, and HBase HMaster. Clients attempt to create the same temporary node; only one succeeds and becomes the master. Others watch the node and retry when it disappears.
3.5 Distributed Locks
Locks are represented by ZNodes. To acquire an exclusive lock, a client creates an ephemeral node under a lock path; only one client can succeed. Others watch the lock node and retry when it is released. Shared locks allow multiple readers but block exclusive locks.
4. ZooKeeper in Large‑Scale Distributed Systems
4.1 ZooKeeper in Hadoop
ZooKeeper provides HA for HDFS NameNode and YARN ResourceManager, and stores YARN application state. Leader election uses a lock node such as
/yarn‑leader‑election/appcluster‑yarn/ActiveBreadCrumb. Standby ResourceManagers register watchers on this node; when the active node fails, the lock node disappears and a new leader is elected.
ResourceManager state (RMStateStore) can be persisted in memory, filesystem, or ZooKeeper; ZooKeeper is recommended for its small state size.
4.2 ZooKeeper in HBase
HBase relies on ZooKeeper for HMaster election, system fault tolerance, RootRegion management, Region state management, and distributed SplitWAL task coordination. RegionServers create temporary nodes under
/hbase/rs; HMaster watches these nodes to detect failures and trigger recovery.
RootRegion location is stored in
/hbase/meta‑region‑server. Changes to RootRegion or Region assignments are propagated via ZooKeeper watches.
SplitWAL tasks are coordinated by a ZNode (e.g.,
/hbase/SplitWAL) that lists which RegionServer processes which WAL segment.
References
“From Paxos to ZooKeeper”.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.