Why a Single Kafka Broker Failure Can Halt All Consumers – Deep Dive into HA
This article explains Kafka's multi‑replica design, ISR mechanism, leader election rules, and producer acknowledgment settings, then shows how the built‑in __consumer_offset topic with a single replica can cause a whole cluster to become unavailable when one broker crashes, and offers practical fixes.
Kafka's Multi‑Replica Redundancy Design
High availability in distributed systems like Kafka is achieved through redundancy. Kafka uses brokers (nodes), topics, partitions, and offsets to store and deliver messages.
Broker : a Kafka server instance.
Topic : a logical channel for messages.
Partition : a subdivision of a topic; each partition resides on a broker.
Offset : the position of a message within a partition.
When a broker fails, its partitions have replicas on other brokers. If the failed broker was the leader, a follower is elected as the new leader, allowing producers and consumers to continue operating.
Typical replication factor is 3, balancing fault tolerance and resource usage.
ISR (In‑Sync Replica) Mechanism
Followers are considered in‑sync only if they have caught up with the leader. The ISR list contains these followers; out‑of‑sync followers are removed from the list.
Leader Election After a Broker Crash
Kafka uses a controller to ensure a single leader per partition. When a leader disappears, the controller selects a new leader from the ISR list; if the ISR is empty, it picks any surviving replica, which may risk data loss.
Producer Acknowledgment Settings (acks)
The
request.required.acksparameter controls reliability:
0: fire‑and‑forget, possible data loss.
1: wait for leader acknowledgment only (default).
all(or
-1): wait for all ISR replicas to acknowledge, providing the strongest durability guarantee.
Even with
acks=all, if the ISR contains only the leader, durability degrades to the behavior of
acks=1.
Root Cause: The __consumer_offset Topic
The internal
__consumer_offsettopic stores consumer offsets. By default it has a replication factor of 1 and 50 partitions, often all placed on a single broker, creating a single point of failure. When that broker goes down, consumers stop reading.
Solutions
Increase the replication factor of
__consumer_offsetby setting
offsets.topic.replication.factorto 3.
Optionally delete the existing
__consumer_offsettopic (by removing its log files) and let Kafka recreate it with the new replication factor.
With the offset topic replicated across brokers, a single broker failure no longer blocks consumer progress.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.