Zookeeper Leader Election Mechanism: Concepts, Algorithms, and Implementation Details
This article explains Zookeeper's leader election mechanism, covering server identifiers, election states, vote structures, the two types of leader elections, the FastLeaderElection algorithm, quorum calculation, and the underlying network and queue components that enable reliable distributed consensus.
Concepts in the Selection Mechanism
serverId (Server ID, also myid)
For example, with three servers numbered 1, 2, and 3, a larger number gives a higher weight in the selection algorithm.
zxid (latest transaction ID, also LastLoggedZxid)
The maximum data ID stored on a server; a larger ID means newer data, giving that server more weight in the election.
epoch (logical clock, also PeerEpoch)
Each server votes for itself; the logical clock value is the same within a voting round and increments after each vote. Votes from previous rounds are ignored.
Election States
LOOKING – candidate state.
FOLLOWING – follower state, synchronizing with the leader and participating in voting.
OBSERVING – observer state, synchronizing with the leader but not voting.
LEADING – leader state.
Election Algorithms
The electionAlg property in zoo.cfg selects an algorithm (1‑3). Understanding them requires basic Paxos theory.
1 – LeaderElection algorithm.
2 – AuthFastLeaderElection algorithm.
3 – FastLeaderElection (default).
Vote Content
Voter ID.
Voter data ID (zxid).
Voter election round.
Voter election state.
Proposer ID.
Proposer election round.
Leader Election
There are two kinds of leader election: (1) the election that occurs when servers are first started, and (2) the election that occurs during runtime when a leader becomes unreachable.
Server Initialization Leader Election
(1) Each server votes for itself, sending a vote of the form (myid, zxid). For the first two servers, the votes are (1,0) and (2,0).
(2) Servers receive votes from others and first check vote validity (same round, LOOKING state).
(3) Vote processing compares received votes with the server's own vote.
(1) Prioritize zxid – the larger zxid wins.
(2) If zxids are equal, the larger myid wins.
Server 1 receives Server 2’s vote (2,0), sees equal zxid and a larger myid, updates its vote to (2,0) and re‑votes. Server 2 keeps its vote.
(4) After each round, servers count votes; when a majority (quorum) agrees on the same vote, a leader is chosen.
(5) Servers update their state: FOLLOWING if they are followers, LEADING if they become the leader.
Server Runtime Leader Election
When the current leader crashes, remaining servers change to LOOKING and start a new election.
(1) Each server sends a vote containing its myid and current zxid (e.g., Server 1 votes (1,123), Server 3 votes (3,122)).
(2) Servers receive votes, validate them, and process them using the same rules as during initialization.
(3) The server with the highest zxid (or highest myid if zxids tie) becomes the new leader.
(4) Vote counting and state updates follow the same steps as the initialization phase.
Leader Election Algorithm Analysis
Since Zookeeper 3.4.0 only the TCP‑based FastLeaderElection algorithm remains. A leader election can start either because a server is initializing or because a running server loses connection to the leader.
First Vote
If no leader exists, each server initially votes for itself. For a 5‑node cluster with SIDs 1‑5 and zxids 9,9,9,8,8, the initial votes are (3,9), (4,8) and (5,8) for the three non‑leader nodes.
Vote Change
Each server compares received votes with its own using the following rules (assuming equal epochs):
If the received zxid is greater, adopt the received vote.
If the received zxid is smaller, keep the current vote.
If zxids are equal, adopt the vote with the larger SID.
If both zxid and SID are smaller, keep the current vote.
Applying these rules, Server 4 changes its vote to (3,9) after receiving a higher zxid, and Server 5 does the same.
Determine Leader
After the second round, each server counts votes. With a quorum of 3 (5/2+1), the vote (3,9) reaches a majority, so Server 3 becomes the leader.
Summary
In Zookeeper, the server with the newest data (largest zxid) is most likely to become leader; if zxids tie, the server with the larger SID wins.
Implementation Details
1. Server States
LOOKING – searching for a leader.
FOLLOWING – follower state.
LEADING – leader state.
OBSERVING – observer state.
2. Vote Data Structure
id – SID of the proposed leader.
zxid – transaction ID of the proposed leader.
electionEpoch – logical clock for the election round.
peerEpoch – epoch of the proposed leader.
state – current server state.
3. QuorumCnxManager: Network I/O
recvQueue – queue for incoming messages.
queueSendMap – per‑SID queues for outgoing messages.
senderWorkerMap – per‑SID send workers.
lastMessageSent – most recent message per SID.
Servers establish TCP connections on port 3888; the server with the larger SID initiates the connection to avoid duplicate links.
4. FastLeaderElection Core
External votes – votes received from other servers.
Internal vote – the server’s own current vote.
Election round – logical clock.
PK – comparison between internal and external votes.
The election proceeds through ten steps: increment election round, initialize vote, send initial vote, receive external votes, compare election rounds, PK votes, possibly change vote, archive received votes, count votes, and finally update server state.
These steps repeat until a leader is elected.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Tech Stack
Java backend, microservices, distributed systems, containerized programming, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
