Understanding ZooKeeper Architecture and FastLeaderElection: A Deep Dive
This article explains ZooKeeper's distributed coordination architecture, the ZAB consensus protocol, server roles, write and read workflows, FastLeaderElection mechanics, configurable election algorithms, and how ZooKeeper can be used to implement reliable distributed locks and leader election.
ZooKeeper Overview
ZooKeeper is a distributed coordination service that provides features such as service discovery, distributed locks, leader election, and configuration management. It stores data in a lightweight in‑memory hierarchical tree, similar to a tiny file system, suitable only for small metadata.
Server Roles
Leader : The sole server that processes all write requests, broadcasts proposals to Followers and Observers, and maintains heartbeats.
Follower : Handles read requests locally, forwards write requests to the Leader, and participates in voting.
Observer : Like a Follower but has no voting rights.
Atomic Broadcast (ZAB) Protocol
ZAB ensures consistency and crash‑recovery for write operations. All writes must go through the Leader, which logs the operation locally and replicates it to Followers. If the Leader fails, Followers elect a new Leader using the election algorithm.
Write Path via Leader
Client sends a write request to the Leader.
Leader creates a Proposal and sends it to all Followers, awaiting ACKs.
Followers ACK the proposal.
Leader receives a majority of ACKs (including its own) and sends a Commit to Followers and Observers.
Leader returns the result to the client.
Important notes:
Observers do not send ACKs.
Only a majority of ACKs is required; the Leader does not need all Followers.
Observers still receive the committed data to serve read requests.
Write Path via Follower/Observer
Followers and Observers can accept write requests but must forward them to the Leader. The rest of the flow is identical to the direct Leader path.
Read Operations
Leader, Followers, and Observers can serve reads directly from their local memory. Adding more Followers/Observers improves read throughput because reads do not require inter‑server communication.
Election Algorithms
The electionAlg configuration selects the algorithm used for leader election. Up to ZooKeeper 3.4.10 the options are:
0 – UDP‑based LeaderElection
1 – UDP‑based FastLeaderElection
2 – UDP + authentication FastLeaderElection
3 – TCP‑based FastLeaderElection (default)
Algorithms 0‑2 are deprecated and will be removed in future releases.
FastLeaderElection Details
Each server stores a unique myid file, a monotonically increasing logicClock, and the highest seen zxid. During election each server broadcasts a vote containing (logicClock, state, myid, zxid, vote_id, vote_zxid). The election proceeds as follows:
Increment logicClock to start a new round.
Clear the local vote box.
Broadcast a vote for itself.
Receive external votes, compare logicClock, and update own vote based on higher zxid or higher myid when zxid ties.
Count votes; if a majority supports the same candidate, the election ends.
Update server state to LEADING if elected, otherwise FOLLOWING .
Key edge cases include handling out‑of‑order logicClock values, vote replacement when a newer vote has a larger zxid, and ensuring observers keep their data in sync without voting.
Failover and Data Consistency
When a Leader fails, Followers start a new election. The newly elected Leader synchronizes missing committed entries to Followers using TRUNC (to delete divergent uncommitted entries) and NEWLEADER / UPTODATE commands. Only entries that have been committed by a majority become visible to clients; uncommitted writes are never exposed.
Distributed Lock with ZooKeeper
Locks are implemented using Ephemeral nodes. Clients attempt to create a designated lock node (e.g., /zkroot/leader). In the non‑fair mode the first client to succeed becomes the lock holder (Leader); others watch the node and retry when it disappears. In the fair mode clients create Ephemeral Sequential nodes; the client with the smallest sequence number holds the lock, and each client watches the node with the next‑lower sequence number.
When the lock holder releases the lock (or crashes), its Ephemeral node is automatically removed, triggering the watch on the next client, which then becomes the new lock holder.
Key Takeaways
ZooKeeper uses a primary‑backup replication model: writes go through a single Leader, reads can be served by any server, making it ideal for read‑heavy workloads.
FastLeaderElection guarantees a single Leader and provides high availability without a single point of failure.
ZAB ensures that committed data survives failover, while uncommitted data never becomes visible.
Distributed locks and leader election share the same underlying ZooKeeper primitives (Ephemeral and Sequential znodes) and watch mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
