Big Data 7 min read

How Kafka Chooses Its Partition Leaders: ZAB, Raft, and Controller Election Explained

This article explains the leader election mechanisms used in big‑data systems—ZAB in Zookeeper, Raft’s role‑based election, their drawbacks such as split‑brain and ZooKeeper overload, and how Kafka’s controller‑based design solves these issues with efficient partition leader selection.

JavaEdge
JavaEdge
JavaEdge
How Kafka Chooses Its Partition Leaders: ZAB, Raft, and Controller Election Explained

Common Leader Election Mechanisms in Big Data

ZAB (used by ZooKeeper)

ZAB proceeds through four phases:

Leader election – candidates compete for a majority.

Discovery (epoch establishment) – the elected leader establishes a new epoch.

Synchronization – the leader synchronizes its log with followers.

Broadcast – the leader starts serving client requests.

Example with three nodes (IDs 1, 2, 3): node 1 starts and becomes provisional leader. When node 2 starts it also becomes a candidate; because neither node has a majority, the node with the higher ID (2) wins the election. Node 3 later discovers that nodes 1 and 2 already have a majority and therefore joins the cluster under leader 2.

Raft

Raft defines three server roles:

Leader – handles all client requests and replicates log entries to followers.

Follower – passive replica that receives log entries from the leader.

Candidate – a server that initiates an election when it does not hear from a valid leader; it becomes leader after obtaining votes from a majority of the cluster.

Raft is a Paxos‑style consensus algorithm; the election process is identical to the description above.

Drawbacks of Traditional ZooKeeper‑Based Election

Split‑brain

ZooKeeper guarantees ordered watch delivery, but network latency can cause different replicas to observe different states simultaneously. This may lead to multiple leaders being elected (split‑brain), breaking consistency.

Herd Effect

If a broker that hosts many partitions crashes, a large number of watches fire at once, generating massive re‑election traffic and possible network congestion.

ZooKeeper Overload

Each replica registers a watch for every partition. In clusters with thousands of partitions the total number of watches can overwhelm ZooKeeper, degrading its performance.

Kafka’s Leader Election Design

Overall Advantage

Kafka elects a single **controller** broker that centrally decides the leader for every partition. The controller notifies the affected brokers via RPC, eliminating per‑partition ZooKeeper watches and avoiding split‑brain, herd effect, and ZooKeeper overload.

Controller Election

All brokers create an *ephemeral* ZNode at the path /controller and set a one‑time watch on that node. When the current controller crashes, its ephemeral node disappears, the watch fires, and every live broker attempts to create a new /controller node. ZooKeeper guarantees that only one broker succeeds; that broker becomes the new controller. Brokers that fail to create the node re‑register the watch for future failures.

Partition Leader Selection

The controller performs the following steps for each partition:

Read the current ISR (in‑sync replica) set from ZooKeeper.

Apply the configured partition‑selection algorithm (e.g., PreferredReplica ) to choose the leader.

Send an RPC to the chosen broker announcing it as the new leader and inform the other replicas of the change.

Kafka ships with several built‑in selection algorithms (PreferredReplica, etc.); they differ only in post‑selection actions such as preferred leader promotion.

Key Operational Details

ZooKeeper watches used for controller election are **one‑time**; after firing they must be re‑registered.

Because leader election for partitions is handled by the controller, no per‑partition watches are created, eliminating the herd effect.

The controller’s RPC‑based notification is more efficient than ZooKeeper’s watch‑queue mechanism.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataZooKeeperKafkaRaftleader election
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.