Operations 11 min read

Why a Single Kafka Broker Failure Can Halt the Entire Cluster

This article explains Kafka's high‑availability architecture, covering multi‑replica redundancy, ISR synchronization, producer ACK settings, and the critical role of the __consumer_offset topic, and shows how to configure replication factors to prevent a single‑node outage from stopping consumption.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Single Kafka Broker Failure Can Halt the Entire Cluster

1. Kafka Outage Triggers High‑Availability Issues

The problem starts with a Kafka outage in a fintech company that uses Kafka instead of RabbitMQ. Although the cluster runs stably most of the time, occasional consumer failures occur when one of the three broker nodes goes down, causing the entire consumer group to stop receiving messages.

2. Kafka's Multi‑Replica Redundancy Design

High availability in distributed systems such as ZooKeeper, Redis, Kafka, and HDFS is typically achieved through redundancy. Key Kafka concepts include:

Broker (Node) : a Kafka server, i.e., a physical node.

Topic : a logical category for messages; producers send to a topic name, consumers read from it.

Partition : each topic is split into one or more partitions; each partition belongs to a single broker.

Offset : the position of a message within a partition, used by consumers to track progress.

Before version 0.8, Kafka had no replication; a broker failure meant loss of all partitions on that broker. Since 0.8, each partition has a leader and one or more followers. Producers and consumers interact only with the leader; followers replicate data from the leader.

When a broker crashes, its partitions' leaders are re‑elected from the ISR (in‑sync replica) list. If the ISR is empty, a new leader is chosen from any surviving replica, which may risk data loss.

How Many Replicas Are Sufficient?

Three replicas are generally enough to guarantee high availability; more replicas increase resource consumption and may degrade performance.

What If Followers Are Not Fully Synchronized with the Leader?

Kafka uses the ISR mechanism. The leader maintains an ISR list of followers that are sufficiently up‑to‑date. Followers that fall behind are removed from the ISR, ensuring only synchronized replicas are considered for leader election.

Leader Election After a Broker Failure

Kafka’s controller selects a new leader from the ISR list. If the previous leader has already stepped down, the controller prevents split‑brain scenarios.

3. ACK Settings Determine Reliability

The producer configuration

request.required.acks

(often written as

acks

) controls how many replicas must acknowledge a write before it is considered successful:

0 : The producer does not wait for any acknowledgment; messages may be lost.

1 : Only the leader’s acknowledgment is required; if the leader fails before followers replicate, data can be lost. This is the default setting.

all (or

-1

): All in‑sync replicas must acknowledge the write, providing the strongest durability guarantee. However, if the ISR contains only the leader,

all

behaves like

1

.

4. Solving the Consumer‑Offset Problem

In the test environment, the cluster has three brokers, a topic with replication factor 3, six partitions, and

acks=1

. When one broker fails, the cluster re‑elects leaders, but the internal

__consumer_offset

topic has a default replication factor of 1, making it a single point of failure. If the broker holding its partitions dies, all consumers stop.

To fix this:

Delete the existing

__consumer_offset

topic (it cannot be removed via command, so logs were cleared).

Set

offsets.topic.replication.factor=3

to recreate the topic with three replicas.

After replicating

__consumer_offset

, consumer groups continue working even when a broker goes down.

Remaining questions include why the

__consumer_offset

partitions were initially placed on a single broker instead of being distributed.

distributed systemsHigh AvailabilitykafkaReplicationconsumer offset
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.