Operations 10 min read

Why a Single Kafka Broker Crash Can Halt All Consumers – The HA Explained

An in‑depth look at Kafka’s high‑availability architecture reveals how multi‑replica redundancy, ISR mechanisms, and the configuration of the __consumer_offset topic interact, explaining why a single broker failure can render the entire cluster unusable and how to properly configure replication and ack settings to prevent it.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Single Kafka Broker Crash Can Halt All Consumers – The HA Explained

1. High‑Availability Issues Triggered by a Kafka Outage

The problem starts with a Kafka node failure. In a fintech company using Kafka instead of RabbitMQ, the cluster ran stably until a consumer stopped receiving messages because one of three broker nodes went down, causing the entire consumer group to lose access.

2. Kafka’s Multi‑Replica Redundancy Design

High availability in distributed systems such as ZooKeeper, Redis, Kafka, and HDFS is typically achieved through redundancy. Key Kafka concepts are:

Broker (node) : a Kafka server, i.e., a physical node.

Topic : a logical category for messages; producers send to a topic name and consumers read from it.

Partition : each topic is split into one or more partitions; a partition belongs to a single broker and holds an ordered queue of messages.

Offset : the position of a message within a partition; consumers use offsets to read messages.

Before Kafka 0.8 there was no replica mechanism, so a broker crash caused data loss for its partitions. Since 0.8, each partition has a leader replica and multiple follower replicas. Producers and consumers interact only with the leader; followers asynchronously sync data from the leader.

When a broker fails, its partitions’ leaders are re‑elected from the ISR (In‑Sync Replica) list. If the ISR is empty, a new leader is chosen from surviving replicas, which may risk data loss.

3. Ack Parameter Determines Reliability

Setting ack=0

The producer sends the message and does not wait for any acknowledgment, so the message may be lost if the broker crashes.

Setting ack=1

The producer considers the send successful once the leader acknowledges receipt; followers may not have synced yet, so a leader crash can still cause loss. This is Kafka’s default.

Setting ack=all (or -1)

The send is successful only when the leader and all ISR followers have replicated the message. However, if the ISR contains only the leader, ack=all behaves like ack=1.

4. Solving the Problem

In the test environment there were 3 brokers, a replication factor of 3 for the topic, 6 partitions, and ack=1. When one broker went down, the cluster elected a new leader from the ISR, but the internal __consumer_offset topic had a replication factor of 1, making it a single point of failure. Deleting the __consumer_offset topic is not possible directly; the author removed its logs to delete it, then set

offsets.topic.replication.factor=3

to give it three replicas.

After replicating __consumer_offset, a broker crash no longer stops consumer consumption.

Delete the __consumer_offset topic (by removing its log files).

Configure

offsets.topic.replication.factor

to 3 to replicate the offset topic.

High AvailabilityKafkaReplicationISRAckconsumer offset
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.