How Kafka Achieves High Availability: Replication & Leader Election Explained
This article explains why Kafka needs high availability, how data replication and leader election work together to keep partitions serving despite broker failures, and details the Zookeeper structures and broker failover process that enable Kafka's robust HA design.
Abstract
Before version 0.8 Kafka did not provide a High Availability (HA) mechanism; when one or more brokers crashed, all partitions on those brokers stopped serving, and permanent loss could cause data loss. One of Kafka's design goals is to ensure data persistence and high availability as cluster size grows.
Why Kafka Needs High Availability
Need for Replication
Without replication, a broker failure makes all its partitions unavailable, violating Kafka's persistence and delivery guarantees. In synchronous producer mode the client retries message.send.max.retries (default 3) before throwing an exception, potentially blocking data; in asynchronous mode the exception is logged and the client continues, leading to possible data loss.
Need for Leader Election
When replication is introduced, each partition has multiple replicas. One replica must be elected leader to handle all reads and writes while the others (followers) replicate data, simplifying consistency and reducing the complexity of N×N synchronization.
Kafka HA Design
Distributing Replicas Across the Cluster
Kafka strives to balance load by spreading partitions evenly across brokers and scattering replicas of the same partition across different machines. The algorithm is:
Sort all brokers (n brokers) and partitions.
Assign partition i to broker (i mod n).
Assign replica j of partition i to broker ((i + j) mod n).
Data Replication
Replication must address four problems: how to propagate messages, how many replicas must acknowledge before the producer receives an ACK, how to handle a non‑working replica, and how to recover a failed replica.
Propagate Messages
The producer discovers the partition leader via Zookeeper and sends the message only to that leader, which writes it to its local log. Each follower pulls the data from the leader, writes it to its own log, and immediately sends an ACK to the leader. Once the leader receives ACKs from all in‑sync replicas (ISR), the message is considered committed, the high‑watermark (HW) is advanced, and the leader replies to the producer. Consumers read only committed messages from the leader.
ACK Requirements
A broker is considered alive if it maintains a Zookeeper session and its follower stays within configured lag limits (either replica.lag.max.messages default 4000 or replica.lag.time.max.ms default 10000). The ISR list contains replicas that are in sync; lagging followers are removed from ISR.
Kafka uses an ISR‑based approach that is neither fully synchronous nor fully asynchronous, achieving a balance between durability and throughput. Producers can control acknowledgment behavior with request.required.acks, choosing to wait for all ISR replicas, a subset, or none.
Leader Election Algorithm
Kafka does not adopt a classic majority‑vote algorithm. Instead it uses a method similar to Microsoft’s PacificA, relying on the ISR set. When a leader fails, a new leader must contain all committed messages. The controller selects a new leader from the ISR (or any alive replica if ISR is empty) and updates the partition state via RPC, avoiding the split‑brain and herd‑effect problems of Zookeeper watches.
Handling All Replicas Down
If every replica of a partition is down, Kafka can either wait for an ISR replica to recover (favoring consistency) or immediately elect the first alive replica (favoring availability). Kafka 0.8 adopts the latter, allowing the partition to become available even if some data may be lost.
How to Elect a Leader
Followers set a watch on the leader’s Zookeeper node; when the node disappears, all followers attempt to create it, and the one that succeeds becomes the new leader. This method suffers from split‑brain, herd‑effect, and Zookeeper load issues. Kafka 0.8 therefore introduces a dedicated controller that performs leader election and partition reassignment, notifying brokers via RPC.
Zookeeper Structures for HA
Key znodes used by Kafka:
/admin/preferred_replica_election – JSON schema for preferred replica election.
/admin/reassign_partitions – JSON schema for moving partitions to new brokers.
/admin/delete_topics – JSON schema for deleting topics.
/brokers/ids/[brokerId] – stores live broker information (host, port, jmx_port).
/brokers/topics/[topic] – maps each partition to an ordered list of replica broker IDs.
/brokers/topics/[topic]/partitions/[partition]/state – contains version, ISR array, leader ID, controller_epoch, and leader_epoch.
/controller – current controller broker ID.
/controller_epoch – integer epoch of the controller.
Broker Failover Process
The controller watches broker znodes; when a broker disappears Zookeeper fires the watch.
The controller builds a set of all partitions that were hosted on the failed broker.
For each partition the controller reads the current ISR, selects a new leader (preferably from ISR), updates the ISR, leader, leader_epoch and controller_epoch in the partition state znode.
The controller sends a LeaderAndISRRequest via RPC to the affected brokers.
Next Article Preview
The next part will dive deeper into Kafka HA exception handling, covering detailed broker failover steps, follower fetch mechanisms, partition reassignment procedures, and controller failure recovery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
