Big Data 19 min read

Kafka High Availability Design: Data Replication and Leader Election

This article explains why Kafka introduced high‑availability features after version 0.8, detailing the necessity of data replication and leader election, describing Kafka’s replica distribution algorithm, replication mechanics, acknowledgment requirements, leader‑election strategies, Zookeeper structures, and the broker failover process.

Architect
Architect
Architect
Kafka High Availability Design: Data Replication and Leader Election

Summary

Kafka did not provide a High Availability (HA) mechanism before version 0.8; a broker failure would make all its partitions unavailable and could cause data loss. Starting with 0.8, Kafka introduced HA through data replication and leader election to ensure durability and fault tolerance in large clusters.

Why Kafka Needs High Availability

Why Replication Is Needed

Without replication, a broker crash renders all its partitions unreadable and prevents producers from writing new data, violating Kafka’s delivery guarantees. Replication ensures that the failure of one or more machines does not reduce overall system availability, which becomes increasingly important as cluster size grows.

Why Leader Election Is Needed

When a partition has multiple replicas, one replica must act as the leader that handles all reads and writes, while the others follow. This design simplifies consistency management, reduces the number of synchronization paths, and improves performance compared to a fully peer‑to‑peer approach.

Kafka HA Design Analysis

Even Distribution of Replicas Across the Cluster

Kafka aims to spread partitions and their replicas evenly across brokers. A typical deployment has more partitions than brokers, and replicas of the same partition are placed on different machines to avoid a single point of failure. The allocation algorithm sorts brokers and partitions, then assigns the i‑th partition to broker (i mod n) and its j‑th replica to broker ((i + j) mod n).

Data Replication

Replication must answer four questions: how messages are propagated, how many replicas must acknowledge before an ACK is sent to the producer, how to handle non‑working replicas, and how to recover failed replicas.

Propagating Messages

Producers send messages only to the leader of a partition (found via Zookeeper). The leader writes to its local log; followers pull from the leader, keeping the same order. After a follower writes the message, it sends an ACK to the leader. When all in‑sync replicas (ISR) have ACKed, the leader marks the message as committed and replies to the producer.

ACK Requirements

Kafka tracks the ISR list; a follower that falls too far behind (configurable via replica.lag.max.messages or replica.lag.time.max.ms) is removed from ISR. Only messages acknowledged by all ISR members are considered committed, balancing durability and throughput. Producers can control the required number of acknowledgments with request.required.acks.

Leader Election Algorithm

When a leader fails, Kafka must elect a new leader from the remaining replicas. Kafka uses an ISR‑based approach similar to Microsoft’s PacificA algorithm rather than a strict majority‑vote. Any replica in ISR can become the new leader, allowing the system to tolerate up to f failed replicas while still guaranteeing that committed messages are not lost.

Handling All Replicas Down

If every replica of a partition is down, Kafka can either wait for any replica to recover and become leader, or immediately promote the first alive replica (even if not in ISR). The former improves availability but may increase downtime; the latter favors faster recovery at the possible cost of data loss.

How to Elect a Leader

Followers set a watch on Zookeeper; when the leader’s ephemeral znode disappears, all followers attempt to create it, and the one that succeeds becomes the new leader. To avoid split‑brain, herd effect, and Zookeeper overload, Kafka elects a single controller broker that performs leader election and notifies brokers via RPC.

HA‑Related Zookeeper Structure

The Zookeeper hierarchy stores admin commands, broker registrations, topic metadata, partition state, and controller information. Important znodes include /admin/preferred_replica_election, /admin/reassign_partitions, /admin/delete_topics, /brokers/ids/[brokerId], /brokers/topics/[topic], and /brokers/topics/[topic]/partitions/[partitionId]/state. Schemas for each znode are defined in JSON format and examples are provided in the original article.

Broker Failover Process Overview

When a broker dies, its znode is removed, triggering the controller’s watch.

The controller builds a set of all partitions that were hosted on the failed broker.

For each partition, the controller reads the current ISR, selects a new leader (preferring a surviving ISR member), updates the ISR, and writes the new leader and epoch information back to Zookeeper.

The controller sends a LeaderAndISRRequest to the affected brokers via RPC, completing the failover.

Next Part Preview

The next article will cover detailed handling of Kafka HA exceptions, including broker failover procedures, follower fetch mechanisms, replica reassignment, and controller failure recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ZooKeeperKafkaReplicationleader election
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.