How Kafka Chooses Its Partition Leaders: ZAB, Raft, and Controller Election Explained
This article explains the leader election mechanisms used in big‑data systems—ZAB in Zookeeper, Raft’s role‑based election, their drawbacks such as split‑brain and ZooKeeper overload, and how Kafka’s controller‑based design solves these issues with efficient partition leader selection.
Common Leader Election Mechanisms in Big Data
ZAB (used by ZooKeeper)
ZAB proceeds through four phases:
Leader election – candidates compete for a majority.
Discovery (epoch establishment) – the elected leader establishes a new epoch.
Synchronization – the leader synchronizes its log with followers.
Broadcast – the leader starts serving client requests.
Example with three nodes (IDs 1, 2, 3): node 1 starts and becomes provisional leader. When node 2 starts it also becomes a candidate; because neither node has a majority, the node with the higher ID (2) wins the election. Node 3 later discovers that nodes 1 and 2 already have a majority and therefore joins the cluster under leader 2.
Raft
Raft defines three server roles:
Leader – handles all client requests and replicates log entries to followers.
Follower – passive replica that receives log entries from the leader.
Candidate – a server that initiates an election when it does not hear from a valid leader; it becomes leader after obtaining votes from a majority of the cluster.
Raft is a Paxos‑style consensus algorithm; the election process is identical to the description above.
Drawbacks of Traditional ZooKeeper‑Based Election
Split‑brain
ZooKeeper guarantees ordered watch delivery, but network latency can cause different replicas to observe different states simultaneously. This may lead to multiple leaders being elected (split‑brain), breaking consistency.
Herd Effect
If a broker that hosts many partitions crashes, a large number of watches fire at once, generating massive re‑election traffic and possible network congestion.
ZooKeeper Overload
Each replica registers a watch for every partition. In clusters with thousands of partitions the total number of watches can overwhelm ZooKeeper, degrading its performance.
Kafka’s Leader Election Design
Overall Advantage
Kafka elects a single **controller** broker that centrally decides the leader for every partition. The controller notifies the affected brokers via RPC, eliminating per‑partition ZooKeeper watches and avoiding split‑brain, herd effect, and ZooKeeper overload.
Controller Election
All brokers create an *ephemeral* ZNode at the path /controller and set a one‑time watch on that node. When the current controller crashes, its ephemeral node disappears, the watch fires, and every live broker attempts to create a new /controller node. ZooKeeper guarantees that only one broker succeeds; that broker becomes the new controller. Brokers that fail to create the node re‑register the watch for future failures.
Partition Leader Selection
The controller performs the following steps for each partition:
Read the current ISR (in‑sync replica) set from ZooKeeper.
Apply the configured partition‑selection algorithm (e.g., PreferredReplica ) to choose the leader.
Send an RPC to the chosen broker announcing it as the new leader and inform the other replicas of the change.
Kafka ships with several built‑in selection algorithms (PreferredReplica, etc.); they differ only in post‑selection actions such as preferred leader promotion.
Key Operational Details
ZooKeeper watches used for controller election are **one‑time**; after firing they must be re‑registered.
Because leader election for partitions is handled by the controller, no per‑partition watches are created, eliminating the herd effect.
The controller’s RPC‑based notification is more efficient than ZooKeeper’s watch‑queue mechanism.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
