Big Data 9 min read

Why a Single Kafka Broker Failure Can Halt All Consumers – Deep Dive into HA

This article explains Kafka's multi‑replica design, ISR mechanism, leader election rules, and producer acknowledgment settings, then shows how the built‑in __consumer_offset topic with a single replica can cause a whole cluster to become unavailable when one broker crashes, and offers practical fixes.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Single Kafka Broker Failure Can Halt All Consumers – Deep Dive into HA

Kafka's Multi‑Replica Redundancy Design

High availability in distributed systems like Kafka is achieved through redundancy. Kafka uses brokers (nodes), topics, partitions, and offsets to store and deliver messages.

Broker : a Kafka server instance.

Topic : a logical channel for messages.

Partition : a subdivision of a topic; each partition resides on a broker.

Offset : the position of a message within a partition.

When a broker fails, its partitions have replicas on other brokers. If the failed broker was the leader, a follower is elected as the new leader, allowing producers and consumers to continue operating.

Typical replication factor is 3, balancing fault tolerance and resource usage.

ISR (In‑Sync Replica) Mechanism

Followers are considered in‑sync only if they have caught up with the leader. The ISR list contains these followers; out‑of‑sync followers are removed from the list.

Leader Election After a Broker Crash

Kafka uses a controller to ensure a single leader per partition. When a leader disappears, the controller selects a new leader from the ISR list; if the ISR is empty, it picks any surviving replica, which may risk data loss.

Producer Acknowledgment Settings (acks)

The

request.required.acks

parameter controls reliability:

0

: fire‑and‑forget, possible data loss.

1

: wait for leader acknowledgment only (default).

all

(or

-1

): wait for all ISR replicas to acknowledge, providing the strongest durability guarantee.

Even with

acks=all

, if the ISR contains only the leader, durability degrades to the behavior of

acks=1

.

Root Cause: The __consumer_offset Topic

The internal

__consumer_offset

topic stores consumer offsets. By default it has a replication factor of 1 and 50 partitions, often all placed on a single broker, creating a single point of failure. When that broker goes down, consumers stop reading.

Solutions

Increase the replication factor of

__consumer_offset

by setting

offsets.topic.replication.factor

to 3.

Optionally delete the existing

__consumer_offset

topic (by removing its log files) and let Kafka recreate it with the new replication factor.

With the offset topic replicated across brokers, a single broker failure no longer blocks consumer progress.

High AvailabilityKafkaReplicationLeader ElectionISRConsumer Offsets
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.