Big Data 19 min read

Understanding Kafka High Availability: Data Replication and Leader Election

The article explains why Kafka introduced high availability starting with version 0.8, detailing the need for data replication and leader election, describing replica distribution algorithms, replication mechanics, ISR handling, ZooKeeper structures, and the broker failover process to ensure fault‑tolerant streaming.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Understanding Kafka High Availability: Data Replication and Leader Election

Why Kafka Needs High Availability

Before version 0.8 Kafka had no high‑availability mechanism; if a broker failed, all its partitions became unavailable and data could be lost, contradicting Kafka’s goal of durable, reliable messaging.

Why Replication Is Required

Without replication, a broker crash makes all partitions on that broker unreadable and prevents producers from writing, leading to data loss and reduced system availability, especially as cluster size grows.

In synchronous producer mode, after message.send.max.retries (default 3) attempts, an exception is thrown, forcing the user to stop or continue sending, potentially causing data blockage or loss.

In asynchronous mode, the producer retries the same number of times, logs the exception, and continues, which also leads to data loss without a callback interface.

Why Leader Election Is Needed

After replication, each partition has multiple replicas; one replica must be elected as the leader that handles all reads and writes, while followers replicate from the leader, ensuring consistency and simplifying coordination.

Kafka HA Design Analysis

Evenly Distributing Replicas Across the Cluster

Kafka aims to balance load by spreading partitions and their replicas across brokers. A typical deployment has more partitions than brokers, and replicas of the same partition are placed on different machines to avoid a single point of failure.

The replica assignment algorithm is:

Sort all brokers and partitions.

Assign partition i to broker (i mod n).

Assign replica j of partition i to broker ((i + j) mod n).

Data Replication

Replication must answer how messages are propagated, how many replicas must acknowledge before an ACK is sent to the producer, how to handle failed replicas, and how to recover failed replicas.

Propagating Messages

Producers send messages to the partition leader, which writes to its local log. Followers pull data from the leader, write to their logs, and acknowledge. Once all in‑sync replicas (ISR) have ACKed, the leader commits the message, updates the high‑water mark (HW), and replies to the producer.

Followers ACK immediately after receiving data, improving throughput but meaning committed messages may reside only in memory, not yet on disk.

Consumers read only from the leader and see only committed messages (offset < HW).

Ensuring Sufficient Acknowledgments

A broker is considered alive if it maintains a ZooKeeper session and its followers stay within configured lag limits ( replica.lag.max.messages or replica.lag.time.max.ms).

Kafka uses an ISR list; followers that fall behind are removed from ISR. Kafka’s hybrid approach (neither fully synchronous nor fully asynchronous) balances durability and throughput.

Leader Election Algorithm

When a leader fails, Kafka must elect a new leader that contains all committed messages. Kafka prefers an ISR‑based election rather than a majority‑vote scheme, allowing the system to tolerate f failed replicas with only 2f+1 total replicas.

Kafka’s election resembles Microsoft’s PacificA algorithm and relies on ZooKeeper watches and a controller that coordinates leader changes via RPC.

Handling All Replicas Down

If all replicas of a partition are down, Kafka can either wait for an ISR replica to recover or elect the first alive replica (even if not in ISR). The latter sacrifices consistency for availability and was used in Kafka 0.8.*.

Leader Election Process

Followers set a ZooKeeper watch on the leader’s ephemeral node; when the leader disappears, all followers attempt to create the node, and the one that succeeds becomes the new leader.

Kafka 0.8.* introduced a controller that centralizes leader election, topic creation, and replica reassignment, reducing ZooKeeper load and avoiding split‑brain, herd effect, and overload issues.

HA‑Related ZooKeeper Structures

Key ZooKeeper znodes include /admin/preferred_replica_election, /admin/reassign_partitions, /admin/delete_topics, /brokers/ids/[brokerId], /brokers/topics/[topic], and /controller. Their JSON schemas define version, broker information, partition‑to‑replica mappings, ISR lists, leader IDs, and controller epochs.

{
  "fields":[
    {"name":"version","type":"int","doc":"version id"},
    {"name":"partitions","type":{"type":"array","items":{"fields":[
      {"name":"topic","type":"string","doc":"topic of the partition"},
      {"name":"partition","type":"int","doc":"partition number"}
    ]}}]
}

Broker Failover Process Overview

Controller watches ZooKeeper; when a broker node disappears, the watch fires and the controller reads the list of surviving brokers.

The controller builds a set of all partitions on the failed broker(s).

For each partition, it reads the current ISR, selects a new leader (preferring a surviving ISR replica), updates ISR, leader_epoch, and controller_epoch in the partition state znode, and sends a LeaderAndISRRequest via RPC.

The failover sequence is illustrated in the accompanying diagram.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityZooKeeperKafkaleader election
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.