Zookeeper Leader Election Explained: Cluster Architecture & Code Walkthrough
This article provides a comprehensive overview of Zookeeper's cluster deployment, explains the four server states, details the leader election process—including initialization, voting, and decision logic—and presents key source code snippets to help developers understand and implement Zookeeper's high‑availability mechanisms.
Cluster Overview
Zookeeper is typically deployed in production as a cluster to ensure high availability. The following diagram shows a typical Zookeeper cluster deployment architecture.
Each Zookeeper server node communicates with the leader node, storing data and log backups. The cluster is considered available only when a majority of nodes are up. This article focuses on Zookeeper 3.8.0 and analyzes the election process from the source code perspective.
Cluster Node States
The node states are defined in the QuorumPeer#ServerState enum, which includes four states: LOOKING, FOLLOWING, LEADING, and OBSERVING.
public enum ServerState {<br/> // Looking for leader; no leader currently in the cluster.<br/> LOOKING,<br/> // Follower state; the server acts as a follower.<br/> FOLLOWING,<br/> // Leader state; the server acts as the leader.<br/> LEADING,<br/> // Observer state; the server acts as an observer.<br/> OBSERVING<br/>}Leader Election Process
Startup and Initialization
QuorumPeerMainis Zookeeper's entry class. It starts via the main method.
// Simplified core code<br/>public static void main(String[] args) {<br/> QuorumPeerMain main = new QuorumPeerMain();<br/> main.initializeAndRun(args);<br/>}<br/><br/>protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {<br/> // Cluster mode startup<br/> if (args.length == 1 && config.isDistributed()) {<br/> runFromConfig(config);<br/> } else {<br/> // other modes<br/> }<br/>}<br/><br/>public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {<br/> // Start quorum peer<br/> quorumPeer.start();<br/>}The QuorumPeer thread runs QuorumPeer#run(), which continuously checks the cluster state and decides whether to perform an election or synchronize data.
@Override<br/>public void run() {<br/> try {<br/> while (running) {<br/> switch (getPeerState()) {<br/> case LOOKING:<br/> // Vote for self<br/> setCurrentVote(makeLEStrategy().lookForLeader());<br/> break;<br/> case OBSERVING:<br/> setObserver(makeObserver(logFactory));<br/> observer.observeLeader();<br/> break;<br/> case FOLLOWING:<br/> setFollower(makeFollower(logFactory));<br/> follower.followLeader();<br/> break;<br/> case LEADING:<br/> setLeader(makeLeader(logFactory));<br/> leader.lead();<br/> setLeader(null);<br/> break;<br/> }<br/> }<br/> } finally {<br/> // cleanup<br/> }<br/>}Conduct Election
The core election class is FastLeaderElection, which handles vote creation and processing.
public Vote lookForLeader() throws InterruptedException {<br/> // Create a vote box for the current election round<br/> Map<Long, Vote> recvset = new HashMap<>();<br/> // Box for votes from an existing leader<br/> Map<Long, Vote> outofelection = new HashMap<>();<br/> int notTimeout = minNotificationInterval;<br/> synchronized (this) {<br/> // Increment local election epoch<br/> logicalclock.incrementAndGet();<br/> // Vote for self<br/> updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());<br/> }<br/> // Broadcast vote<br/> sendNotifications();<br/> SyncedLearnerTracker voteSet = null;<br/> while (self.getPeerState() == ServerState.LOOKING && (!stop)) {<br/> if (n.electionEpoch > logicalclock.get()) {<br/> logicalclock.set(n.electionEpoch);<br/> recvset.clear();<br/> if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {<br/> updateProposal(n.leader, n.zxid, n.peerEpoch);<br/> } else {<br/> updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());<br/> }<br/> sendNotifications();<br/> } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {<br/> updateProposal(n.leader, n.zxid, n.peerEpoch);<br/> sendNotifications();<br/> }<br/> // Receive vote from network layer<br/> Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);<br/> recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));<br/> // Majority logic<br/> voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));<br/> }<br/>}The method totalOrderPredicate implements the vote comparison logic.
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {<br/> if (self.getQuorumVerifier().getWeight(newId) == 0) {<br/> return false;<br/> }<br/> return (newEpoch > curEpoch)<br/> || (newEpoch == curEpoch && (newZxid > curZxid || (newZxid == curZxid && newId > curId)));<br/>}The election proceeds as follows:
Compare election epochs; the higher epoch wins.
If epochs are equal, compare zxid; the larger zxid wins.
If both are equal, compare server IDs; the larger ID wins. After election, sendNotifications() notifies other nodes.
Process Summary
The article outlined Zookeeper's startup, leader election, and vote confirmation processes, highlighting the sophisticated design that ensures high performance and reliability.
Voting Process
During voting, a larger zxid increases the chance of becoming leader because it indicates more up‑to‑date data, reducing synchronization overhead. The diagram below shows a five‑node election scenario where server3 (zxid = 9) ultimately becomes leader.
Multi‑layer Network Architecture
Zookeeper separates transport and business layers. SendWorker and RecvWorker handle network packets, while WorkerSender and WorkerReceiver process business‑level data.
Leader Election Source Flow
Combining the above analysis, a detailed flowchart of Zookeeper's startup and election process can be constructed to aid developers in understanding the source code.
Reference Documents
Apache Zookeeper official website
Analysis of Zookeeper leader election mechanism
Understanding Zookeeper's leader election process
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
