Backend Development 9 min read

Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA

This article explains Kafka's high‑availability mechanisms, covering multi‑replica design, ISR synchronization, leader election, the impact of the request.required.acks setting, and how the default __consumer_offset topic can become a single point of failure, with concrete steps to fix it.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA

The story starts with a Kafka outage at a fintech company that uses Kafka instead of RabbitMQ, prompting an investigation into why a single broker failure caused all consumers to stop receiving messages.

Kafka's Multi‑Replica Redundancy Design

High availability in distributed systems like Zookeeper, Redis, Kafka, and HDFS is typically achieved through redundancy.

Key Kafka concepts:

Broker (node) : a Kafka server instance.

Topic : a logical category for messages; producers send to a topic name, consumers read from it.

Partition : each topic is split into one or more partitions, each belonging to a single broker.

Offset : the position of a message within a partition, used by consumers to track progress.

When a broker goes down, its partitions have replicas on other brokers. If the leader of a partition fails, a follower from the ISR (In‑Sync Replica) list is elected as the new leader, allowing producers and consumers to continue operating.

Typical replication factor of 3 provides a good balance between fault tolerance and resource usage.

Followers and leaders are not fully synchronous; they maintain an ISR list of replicas that are sufficiently up‑to‑date. If a follower falls behind, it is removed from the ISR.

Leader election after a broker failure follows the ISR list; if the ISR is empty, any surviving replica can become leader, though this may risk data loss.

Ack Parameter Determines Reliability

The producer configuration

request.required.acks

controls how many acknowledgments are required for a send to be considered successful:

0

: fire‑and‑forget; messages may be lost.

1

(default): only the leader must acknowledge; if the leader crashes before followers replicate, data can be lost.

all

(or

-1

): the leader and all ISR followers must acknowledge, providing the strongest durability guarantee, but only when at least two ISR replicas exist.

Solving the Problem

The root cause was the internal

__consumer_offset

topic, which by default has a replication factor of 1 and 50 partitions, often all residing on a single broker. When that broker failed, consumers could no longer read offsets, halting consumption.

Fixes:

Delete the existing

__consumer_offset

topic (it cannot be removed directly, so delete its log files).

Set

offsets.topic.replication.factor

to 3 in the broker configuration, ensuring the offset topic is replicated across three brokers.

After increasing the replication factor, the offset topic is no longer a single point of failure, and the cluster remains operational even when one broker is down.

Images illustrating the architecture:

distributed systemsHigh AvailabilityKafkaReplicationLeader Electionconsumer offset
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.