Mastering Zookeeper: Core Concepts, Architecture, and Leader Election Explained
This article provides a comprehensive overview of Zookeeper, covering its purpose as a distributed coordination service, data structures, system architecture, ZAB protocol, leader election mechanisms, watcher functionality, key features, request ordering, synchronization methods, and potential consistency issues.
1. What is Zookeeper and what can it do?
Zookeeper is an open‑source centralized service used to maintain configuration information, naming, provide distributed synchronization, and offer group services.
Based on Zookeeper you can implement data publish/subscribe, load balancing, naming services, distributed coordination/notification, cluster management, master election, distributed locks, and distributed queues.
The most common use case is as a registration center: producers register their services in Zookeeper, and consumers retrieve the service list to invoke providers such as Dubbo or Kafka.
2. Zookeeper data structures
Zookeeper's namespace resembles a standard file system: paths are slash‑separated, each znode is identified by its path, has a parent (except the root "/"), and cannot be deleted if it has children.
Unlike a file system, each znode can store associated data and is limited to a small amount (typically kilobytes). Zookeeper includes a 1 MB integrity check to prevent misuse as a large data store.
Three types of znodes:
Persistent node : remains after client disconnects.
Ephemeral node : automatically removed when the client disconnects.
Sequential node : a unique 10‑digit number is appended to the path on creation.
Four forms of znodes:
Persistent node (e.g., create /test/a "hello").
Persistent sequential node (use -s flag).
Ephemeral node (use -e flag).
Ephemeral sequential node (use -s -e flags).
3. What does a znode store?
A znode contains data , ACL (access control list), child references, and stat (metadata such as transaction ID, version, timestamp).
data : business information stored in the znode.
acl : permissions (e.g., IP) for client access.
child : references to child znodes.
stat : status info such as transaction ID, version, timestamps.
4. Zookeeper system architecture
Zookeeper consists of servers and clients . Clients connect to any server (unless leaderServes is set) via a TCP connection to send requests and receive responses.
Servers maintain an in‑memory state, a transaction log, and snapshots. As long as a majority of servers are up, the service remains available.
In a Zookeeper ensemble, servers have three roles: Leader, Follower, and Observer.
Leader : initiates voting, updates system state, writes data.
Follower : handles client requests, participates in voting.
Observer : accepts client connections, forwards write requests to the Leader, does not vote, and syncs with the Leader to improve read throughput.
Separating roles avoids too many followers participating in write quorum, allowing high performance with a small cluster; adding Observers scales reads.
Clusters should have an odd number of nodes; as long as a majority are operational, the ensemble is functional.
During startup, a Leader is elected; the Leader processes data updates, and a write is considered successful only when a majority of servers have applied it in memory.
Data consistency relies on the ZAB (Zookeeper Atomic Broadcast) protocol.
5. ZAB protocol
ZAB is an atomic broadcast protocol designed for Zookeeper that supports crash recovery and ensures distributed data consistency.
It operates in two modes: crash recovery and message broadcasting.
Crash recovery : When the service starts or a Leader fails, ZAB enters recovery, elects a new Leader, and waits for a majority of servers to sync before exiting recovery.
Message broadcasting : Once a majority of Followers have synced with the Leader, the cluster enters broadcast mode. New servers join by syncing with the Leader before participating in broadcasts. Only the Leader processes client transactions; Followers forward client requests to the Leader.
6. How does Zookeeper perform Leader election on initialization?
When at least three Zookeeper instances start, the following steps occur:
(1) Each server votes for itself, sending its myid and zxid to others.
(2) Servers receive votes and validate them (same election round, LOOKING state).
(3) Servers compare received votes with their own: higher zxid wins; if equal, higher myid wins.
(4) Servers count votes; when a majority agree on the same vote, that server becomes Leader.
(5) Servers update their state: Followers become FOLLOWING, the elected server becomes LEADING. New servers joining later adopt FOLLOWING directly.
7. How is a new Leader elected after a crash?
1. Remaining non‑Observer servers change state to LOOKING and start a new election.
2. Each non‑Observer issues a vote (same as startup).
3. Servers receive and process votes using the same rules.
4. Votes are counted; a majority determines the new Leader.
5. Servers update their state accordingly.
6. The process mirrors the initial election.
8. Watcher mechanism and its principle
Steps:
Service registration : Provider registers its service by creating a znode.
Service discovery : Consumer fetches the registered info, sets a watch, caches the data locally, and invokes the service.
Service notification : If a provider goes down, its znode is deleted; Zookeeper asynchronously notifies all watching consumers, which then refresh their local cache.
In simple terms, a client registers a watcher on a znode; when that znode changes, the client receives a ZooKeeper notification.
Four characteristics of watchers:
One‑time: after a watch fires, it is removed; to continue watching, the client must set a new watch.
Client‑side serial processing: watch callbacks are executed sequentially, so a slow callback can block others.
Lightweight: a watch event contains only status, type, and path; the client must fetch the actual data.
Asynchronous: notifications are sent asynchronously; ZooKeeper guarantees eventual consistency but not strong consistency.
9. Zookeeper features
Sequential consistency : Leader assigns a monotonically increasing ZXID to preserve request order.
Atomicity : Transactions either succeed on all servers or fail on all.
Single system image : All clients see the same data regardless of which server they connect to.
Reliability : Once a transaction is applied and acknowledged, its state change persists.
Timeliness : Clients eventually read the latest state within a bounded time.
10. How does Zookeeper order requests?
When the Leader receives a request, it assigns a globally unique, incrementing transaction ID (zxid) and places the request in a FIFO queue, which is then sent to all Followers in order.
11. Data synchronization after Leader election
The Leader writes requests, assigns ZXIDs, and records the highest processed ZXID ( maxZXID ) and the lowest ( minZXID ). Followers/Observers track their latest synced ZXID as lastSyncZXID .
Synchronization methods:
Diff (incremental) sync : Leader sends DIFF commands; Followers apply proposals and ACK; when a majority ACK, Leader sends UPTODATE.
Rollback sync : If a follower's lastSyncZXID is ahead of the Leader's maxZXID, it rolls back to maxZXID.
Rollback + diff sync : After a crash, a new Leader may lack some proposals; followers roll back and then perform diff sync.
Full sync : When lastSyncZXID is behind minZXID or the Leader lacks a cache, it sends SNAP commands for complete data transfer.
12. Can Zookeeper experience data inconsistency?
Yes. Zookeeper uses a majority‑write rule: in a three‑node cluster, if two nodes successfully write, the write is considered committed. If a client reads from the third node before it has applied the write, it may see stale data, leading to temporary inconsistency.
NiuNiu MaTe
Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
