Operations 10 min read

Zookeeper Leader Election Explained: Cluster Architecture & Code Walkthrough

This article provides a comprehensive overview of Zookeeper's cluster deployment, explains the four server states, details the leader election process—including initialization, voting, and decision logic—and presents key source code snippets to help developers understand and implement Zookeeper's high‑availability mechanisms.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Zookeeper Leader Election Explained: Cluster Architecture & Code Walkthrough

Cluster Overview

Zookeeper is typically deployed in production as a cluster to ensure high availability. The following diagram shows a typical Zookeeper cluster deployment architecture.

Zookeeper cluster diagram
Zookeeper cluster diagram

Each Zookeeper server node communicates with the leader node, storing data and log backups. The cluster is considered available only when a majority of nodes are up. This article focuses on Zookeeper 3.8.0 and analyzes the election process from the source code perspective.

Cluster Node States

The node states are defined in the QuorumPeer#ServerState enum, which includes four states: LOOKING, FOLLOWING, LEADING, and OBSERVING.

public enum ServerState {<br/>    // Looking for leader; no leader currently in the cluster.<br/>    LOOKING,<br/>    // Follower state; the server acts as a follower.<br/>    FOLLOWING,<br/>    // Leader state; the server acts as the leader.<br/>    LEADING,<br/>    // Observer state; the server acts as an observer.<br/>    OBSERVING<br/>}

Leader Election Process

Startup and Initialization

QuorumPeerMain

is Zookeeper's entry class. It starts via the main method.

// Simplified core code<br/>public static void main(String[] args) {<br/>    QuorumPeerMain main = new QuorumPeerMain();<br/>    main.initializeAndRun(args);<br/>}<br/><br/>protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {<br/>    // Cluster mode startup<br/>    if (args.length == 1 && config.isDistributed()) {<br/>        runFromConfig(config);<br/>    } else {<br/>        // other modes<br/>    }<br/>}<br/><br/>public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {<br/>    // Start quorum peer<br/>    quorumPeer.start();<br/>}

The QuorumPeer thread runs QuorumPeer#run(), which continuously checks the cluster state and decides whether to perform an election or synchronize data.

@Override<br/>public void run() {<br/>    try {<br/>        while (running) {<br/>            switch (getPeerState()) {<br/>                case LOOKING:<br/>                    // Vote for self<br/>                    setCurrentVote(makeLEStrategy().lookForLeader());<br/>                    break;<br/>                case OBSERVING:<br/>                    setObserver(makeObserver(logFactory));<br/>                    observer.observeLeader();<br/>                    break;<br/>                case FOLLOWING:<br/>                    setFollower(makeFollower(logFactory));<br/>                    follower.followLeader();<br/>                    break;<br/>                case LEADING:<br/>                    setLeader(makeLeader(logFactory));<br/>                    leader.lead();<br/>                    setLeader(null);<br/>                    break;<br/>            }<br/>        }<br/>    } finally {<br/>        // cleanup<br/>    }<br/>}

Conduct Election

The core election class is FastLeaderElection, which handles vote creation and processing.

public Vote lookForLeader() throws InterruptedException {<br/>    // Create a vote box for the current election round<br/>    Map<Long, Vote> recvset = new HashMap<>();<br/>    // Box for votes from an existing leader<br/>    Map<Long, Vote> outofelection = new HashMap<>();<br/>    int notTimeout = minNotificationInterval;<br/>    synchronized (this) {<br/>        // Increment local election epoch<br/>        logicalclock.incrementAndGet();<br/>        // Vote for self<br/>        updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());<br/>    }<br/>    // Broadcast vote<br/>    sendNotifications();<br/>    SyncedLearnerTracker voteSet = null;<br/>    while (self.getPeerState() == ServerState.LOOKING && (!stop)) {<br/>        if (n.electionEpoch > logicalclock.get()) {<br/>            logicalclock.set(n.electionEpoch);<br/>            recvset.clear();<br/>            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {<br/>                updateProposal(n.leader, n.zxid, n.peerEpoch);<br/>            } else {<br/>                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());<br/>            }<br/>            sendNotifications();<br/>        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {<br/>            updateProposal(n.leader, n.zxid, n.peerEpoch);<br/>            sendNotifications();<br/>        }<br/>        // Receive vote from network layer<br/>        Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);<br/>        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));<br/>        // Majority logic<br/>        voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));<br/>    }<br/>}

The method totalOrderPredicate implements the vote comparison logic.

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {<br/>    if (self.getQuorumVerifier().getWeight(newId) == 0) {<br/>        return false;<br/>    }<br/>    return (newEpoch > curEpoch)<br/>        || (newEpoch == curEpoch && (newZxid > curZxid || (newZxid == curZxid && newId > curId)));<br/>}

The election proceeds as follows:

Compare election epochs; the higher epoch wins.

If epochs are equal, compare zxid; the larger zxid wins.

If both are equal, compare server IDs; the larger ID wins. After election, sendNotifications() notifies other nodes.

Process Summary

The article outlined Zookeeper's startup, leader election, and vote confirmation processes, highlighting the sophisticated design that ensures high performance and reliability.

Voting Process

During voting, a larger zxid increases the chance of becoming leader because it indicates more up‑to‑date data, reducing synchronization overhead. The diagram below shows a five‑node election scenario where server3 (zxid = 9) ultimately becomes leader.

Zookeeper election example
Zookeeper election example

Multi‑layer Network Architecture

Zookeeper separates transport and business layers. SendWorker and RecvWorker handle network packets, while WorkerSender and WorkerReceiver process business‑level data.

Zookeeper network architecture
Zookeeper network architecture

Leader Election Source Flow

Combining the above analysis, a detailed flowchart of Zookeeper's startup and election process can be constructed to aid developers in understanding the source code.

Reference Documents

Apache Zookeeper official website

Analysis of Zookeeper leader election mechanism

Understanding Zookeeper's leader election process

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsJavahigh availabilityZooKeeperCluster Managementleader election
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.