Big Data 20 min read

How Kafka Achieves High Availability: Replication & Leader Election Explained

This article explains why Kafka needs high availability, how data replication and leader election work together to keep partitions serving despite broker failures, and details the Zookeeper structures and broker failover process that enable Kafka's robust HA design.

Java Backend Technology
Java Backend Technology
Java Backend Technology
How Kafka Achieves High Availability: Replication & Leader Election Explained

Abstract

Before version 0.8 Kafka did not provide a High Availability (HA) mechanism; when one or more brokers crashed, all partitions on those brokers stopped serving, and permanent loss could cause data loss. One of Kafka's design goals is to ensure data persistence and high availability as cluster size grows.

Why Kafka Needs High Availability

Need for Replication

Without replication, a broker failure makes all its partitions unavailable, violating Kafka's persistence and delivery guarantees. In synchronous producer mode the client retries message.send.max.retries (default 3) before throwing an exception, potentially blocking data; in asynchronous mode the exception is logged and the client continues, leading to possible data loss.

Need for Leader Election

When replication is introduced, each partition has multiple replicas. One replica must be elected leader to handle all reads and writes while the others (followers) replicate data, simplifying consistency and reducing the complexity of N×N synchronization.

Kafka HA Design

Distributing Replicas Across the Cluster

Kafka strives to balance load by spreading partitions evenly across brokers and scattering replicas of the same partition across different machines. The algorithm is:

Sort all brokers (n brokers) and partitions.

Assign partition i to broker (i mod n).

Assign replica j of partition i to broker ((i + j) mod n).

Data Replication

Replication must address four problems: how to propagate messages, how many replicas must acknowledge before the producer receives an ACK, how to handle a non‑working replica, and how to recover a failed replica.

Propagate Messages

The producer discovers the partition leader via Zookeeper and sends the message only to that leader, which writes it to its local log. Each follower pulls the data from the leader, writes it to its own log, and immediately sends an ACK to the leader. Once the leader receives ACKs from all in‑sync replicas (ISR), the message is considered committed, the high‑watermark (HW) is advanced, and the leader replies to the producer. Consumers read only committed messages from the leader.

Kafka replication flow diagram
Kafka replication flow diagram

ACK Requirements

A broker is considered alive if it maintains a Zookeeper session and its follower stays within configured lag limits (either replica.lag.max.messages default 4000 or replica.lag.time.max.ms default 10000). The ISR list contains replicas that are in sync; lagging followers are removed from ISR.

Kafka uses an ISR‑based approach that is neither fully synchronous nor fully asynchronous, achieving a balance between durability and throughput. Producers can control acknowledgment behavior with request.required.acks, choosing to wait for all ISR replicas, a subset, or none.

Leader Election Algorithm

Kafka does not adopt a classic majority‑vote algorithm. Instead it uses a method similar to Microsoft’s PacificA, relying on the ISR set. When a leader fails, a new leader must contain all committed messages. The controller selects a new leader from the ISR (or any alive replica if ISR is empty) and updates the partition state via RPC, avoiding the split‑brain and herd‑effect problems of Zookeeper watches.

Handling All Replicas Down

If every replica of a partition is down, Kafka can either wait for an ISR replica to recover (favoring consistency) or immediately elect the first alive replica (favoring availability). Kafka 0.8 adopts the latter, allowing the partition to become available even if some data may be lost.

How to Elect a Leader

Followers set a watch on the leader’s Zookeeper node; when the node disappears, all followers attempt to create it, and the one that succeeds becomes the new leader. This method suffers from split‑brain, herd‑effect, and Zookeeper load issues. Kafka 0.8 therefore introduces a dedicated controller that performs leader election and partition reassignment, notifying brokers via RPC.

Zookeeper Structures for HA

Key znodes used by Kafka:

/admin/preferred_replica_election – JSON schema for preferred replica election.

/admin/reassign_partitions – JSON schema for moving partitions to new brokers.

/admin/delete_topics – JSON schema for deleting topics.

/brokers/ids/[brokerId] – stores live broker information (host, port, jmx_port).

/brokers/topics/[topic] – maps each partition to an ordered list of replica broker IDs.

/brokers/topics/[topic]/partitions/[partition]/state – contains version, ISR array, leader ID, controller_epoch, and leader_epoch.

/controller – current controller broker ID.

/controller_epoch – integer epoch of the controller.

Broker structure diagram
Broker structure diagram
Controller failover diagram
Controller failover diagram

Broker Failover Process

The controller watches broker znodes; when a broker disappears Zookeeper fires the watch.

The controller builds a set of all partitions that were hosted on the failed broker.

For each partition the controller reads the current ISR, selects a new leader (preferably from ISR), updates the ISR, leader, leader_epoch and controller_epoch in the partition state znode.

The controller sends a LeaderAndISRRequest via RPC to the affected brokers.

Broker failover flow diagram
Broker failover flow diagram

Next Article Preview

The next part will dive deeper into Kafka HA exception handling, covering detailed broker failover steps, follower fetch mechanisms, partition reassignment procedures, and controller failure recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ZooKeeperReplication
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.