Why Does a Single Kafka Broker Failure Break the Whole Cluster?

This article explains why a single Kafka broker failure can render the whole cluster unusable, detailing Kafka’s multi‑replica design, leader election, ISR mechanism, acknowledgment settings, and how to configure replication factors and consumer offset topics to achieve true high availability.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
Why Does a Single Kafka Broker Failure Break the Whole Cluster?

Kafka Crash Triggers High Availability Issues

Problem starts with a Kafka outage.

The author works at a fintech company that uses Kafka (instead of RabbitMQ) for log processing. Although the cluster runs stably, occasional consumer failures were observed when one of three broker nodes went down, causing the entire consumer group to stop receiving messages.

Kafka's Multi‑Replica Redundancy Design

Physical Model

Logical Model

Broker (node): a Kafka server; each broker is a physical node.

Topic : messages are grouped by topic name; producers send to a topic, consumers read from it.

Partition : each topic is split into one or more partitions; a partition belongs to a single broker.

Offset : the position of a message within a partition; consumers use offsets to read messages.

Before Kafka 0.8 there was no replication; a broker failure made all its partitions unavailable. Since 0.8, each partition has a Leader replica and one or more Follower replicas. Producers and consumers interact only with the leader; followers replicate data from the leader.

If a broker crashes, its partitions still have replicas on the remaining brokers. If the crashed broker was the leader, a new leader is elected from the in‑sync replica (ISR) list, and the cluster continues operating.

Key questions:

How many replicas are enough? Typically three replicas provide high availability; the replication-factor can be increased if needed.

What if followers are not fully synchronized? Kafka uses the ISR mechanism; only followers that are sufficiently up‑to‑date stay in the ISR list.

How is a new leader elected after a crash? Kafka selects the first replica in the ISR list that is still alive; a controller ensures only one leader exists.

Ack Parameter Determines Reliability

The producer configuration request.required.acks controls how many acknowledgments are required before a send is considered successful. It has three possible values:

0 : fire‑and‑forget; messages may be lost.

1 (default): the leader must acknowledge; if the leader crashes before followers replicate, the message can be lost.

All (or -1): the leader and all ISR followers must acknowledge; this provides the strongest durability guarantee, provided at least two ISR replicas exist.

Problem Solving

In the author's test environment there are 3 brokers, a topic with replication factor 3, 6 partitions, and acks=1. When one broker fails, the cluster elects a new leader from the ISR. If the ISR becomes empty, a leader is still chosen from remaining replicas, but data loss is possible.

The root cause of the observed total consumer outage was the internal topic __consumer_offset. By default it has a replication factor of 1 and 50 partitions, often all placed on a single broker, creating a single point of failure. When that broker went down, consumer offsets were unavailable, halting consumption.

Solutions:

Delete the __consumer_offset topic (cannot be removed directly; delete its log files).

Set offsets.topic.replication.factor=3 so the offset topic also has three replicas, eliminating the single‑point‑of‑failure.

After applying these changes, the cluster remains functional even when one broker is down, confirming that proper replication of both user topics and internal offset topics is essential for true Kafka high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaISRACKConsumer Offset
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.