Big Data 36 min read

Why Kafka Needs High Availability: Deep Dive into Replication and Leader Election

This article explains why Kafka introduced High Availability in version 0.8, covering the necessity of data replication and leader election, the internal replication and ACK mechanisms, Zookeeper metadata structures, broker failover procedures, and the command‑line tools that help manage and rebalance a Kafka cluster.

dbaplus Community
dbaplus Community
dbaplus Community
Why Kafka Needs High Availability: Deep Dive into Replication and Leader Election

Why Kafka Needs High Availability

Before version 0.8 Kafka had no HA mechanism; a single broker failure made all partitions on that broker unavailable, and permanent loss of a broker could cause data loss. As clusters grow, the probability of multiple broker failures rises, so Kafka added HA to guarantee data durability and service continuity.

Why Replication Is Required

Without replication, a broker crash blocks both producers and consumers: producers either stop after a few retry attempts (synchronous mode) or silently drop messages (asynchronous mode). Replication ensures that at least one replica remains alive, keeping the system available even when individual brokers fail.

In synchronous producer mode, the producer retries message.send.max.retries (default 3) before throwing an exception.

In asynchronous mode, the exception is logged and the producer continues, leading to potential data loss.

Why Leader Election Is Required

When a partition has multiple replicas, only one replica (the Leader) handles all reads and writes; the others (Followers) replicate from the Leader. This design simplifies consistency, reduces the number of synchronization paths, and improves performance.

Kafka HA Design Overview

Data Replication

Kafka must answer four questions: how to propagate messages, how many replicas must acknowledge before an ACK is sent to the producer, how to handle a failed replica, and how to recover a replica that comes back online.

Replication data flow diagram
Replication data flow diagram

Propagating Messages

A producer first looks up the partition Leader in Zookeeper, sends the record to the Leader, and the Leader writes it to its local log. Each follower pulls the record, writes it to its own log, and acknowledges back to the Leader. When all in‑sync replicas (ISR) have ACKed, the Leader increments the high‑watermark (HW) and replies to the producer.

ACK Requirements

Kafka tracks liveness via Zookeeper sessions and ISR membership. A follower that falls behind more than replica.lag.max.messages (default 4000) or replica.lag.time.max.ms (default 10000) is removed from ISR. Only ISR members are considered for commit, providing a balance between durability and throughput.

Leader Election Algorithm

Kafka does not use classic majority‑vote; instead it relies on the ISR set. If a Leader fails, any ISR member can become the new Leader, allowing the system to tolerate up to f failed replicas while still guaranteeing that committed messages are not lost. Kafka’s algorithm resembles Microsoft’s PacificA and is more efficient for large‑scale log replication.

Broker Failover Process

Controller watches /brokers/ids in Zookeeper. When a broker disappears, the watch fires and the controller learns the current live broker list.

The controller builds a set set_p containing every partition that was hosted on the dead broker(s).

For each partition in set_p:

Read the current ISR from /brokers/topics/[topic]/partitions/[partition]/state.

Select a new Leader: if any ISR replica is alive, pick one; otherwise pick any surviving replica (potential data loss). If no replica survives, set Leader to -1.

Write the new Leader, ISR, leader_epoch and controller_epoch back to the same Zookeeper node.

Send a LeaderAndIsrRequest via RPC to the affected brokers (the controller can batch multiple requests).

Broker failover sequence diagram
Broker failover sequence diagram

Zookeeper Metadata Structures

Kafka stores cluster state in a hierarchy of Zookeeper znodes. Key paths include: /admin/preferred_replica_election – JSON schema describing the preferred replica election request. /admin/reassign_partitions – JSON schema for partition reassignment plans. /admin/delete_topics – JSON schema for topics pending deletion. /brokers/ids/[brokerId] – Information about live brokers (host, port, JMX port). /brokers/topics/[topic] – Mapping of partitions to their replica lists; the first replica is the preferred replica. /brokers/topics/[topic]/partitions/[partition]/state – Current leader, ISR, and epoch information. /controller – The current controller broker ID.

{
  "fields": [
    {"name": "version", "type": "int", "doc": "version id"},
    {"name": "partitions", "type": {"type": "array", "items": {
      "fields": [
        {"name": "topic", "type": "string", "doc": "topic of the partition"},
        {"name": "partition", "type": "int", "doc": "partition id"}
      ]
    }},
    "doc": "array of partitions for which preferred replica election should be triggered"
  ]
}

Operational Tools

Kafka ships with several command‑line utilities to manage HA: $KAFKA_HOME/bin/kafka-topics.sh – Create, delete, describe topics and modify configuration parameters such as unclean.leader.election.enable, retention.ms, etc. $KAFKA_HOME/bin/kafka-replica-verification.sh – Verify that all replicas of selected topics are in sync. $KAFKA_HOME/bin/kafka-preferred-replica-election.sh – Trigger preferred‑replica leader election for partitions whose preferred replica is not the current leader. $KAFKA_HOME/bin/kafka-reassign-partitions.sh – Generate, execute, and verify partition reassignment plans (including changing replication factor). bin/kafka-run-class.sh kafka.tools.StateChangeLogMerger – Merge state‑change logs from all brokers to help diagnose leader‑election or failover issues.

These tools interact with the Zookeeper paths described above; they write JSON plans to /admin/* nodes, and the controller performs the actual state changes.

Balancing Leader Distribution

Kafka can automatically rebalance leaders when auto.leader.rebalance.enable is true. The controller periodically checks leader distribution (interval controlled by leader.imbalance.check.interval.seconds) and moves leaders to their preferred replicas if the imbalance exceeds leader.imbalance.per.broker.percentage. Manual rebalancing can also be performed with the preferred‑replica election tool.

Overall, Kafka’s HA design combines data replication, ISR‑based commit semantics, lightweight Zookeeper coordination, and a set of admin utilities to provide fault‑tolerant, high‑throughput messaging for large‑scale data pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityKafkaReplicationleader election
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.