Operations 10 min read

Why Does a Single Kafka Broker Crash Bring Down All Consumers?

An in‑depth look at Kafka’s high‑availability mechanisms reveals how multi‑replica design, ISR leader election, and the request.required.acks setting interact, why a single broker failure—especially of the __consumer_offset topic—can halt consumption, and how to configure replication factors to prevent such outages.

dbaplus Community
dbaplus Community
dbaplus Community
Why Does a Single Kafka Broker Crash Bring Down All Consumers?

1. Kafka Outage and High‑Availability Issue

The article begins with a real incident where one of three Kafka brokers went down, causing all consumers to stop receiving messages despite the remaining two brokers being up. This prompts an investigation into Kafka’s high‑availability design.

2. Kafka’s Multi‑Replica Redundancy Design

Kafka achieves HA through replication of each partition across multiple brokers. Key concepts are introduced:

Broker (Node) : a Kafka server, i.e., a physical machine.

Topic : a logical category for messages; producers write to a topic name, consumers read from it.

Partition : a topic is split into one or more ordered partitions; each partition resides on a broker.

Offset : the position of a message within a partition, used by consumers to track progress.

Before version 0.8 Kafka had no replication; a broker failure meant loss of all its partitions. Since 0.8, each partition has a Leader replica and one or more Follower replicas. Producers and consumers interact only with the leader; followers replicate data from the leader.

When a broker (and thus its leader partitions) fails, a new leader is elected from the in‑sync replica (ISR) list. If the ISR list is empty, any surviving replica can become leader, which may risk data loss.

The ISR mechanism ensures that only replicas that are sufficiently up‑to‑date remain in the list. Followers that fall behind are removed, preventing them from being elected as leader.

Leader election in Kafka relies on a controller that guarantees a single leader at any time, avoiding split‑brain scenarios.

3. Acknowledgment (acks) Parameter Determines Reliability

The producer’s request.required.acks setting controls when a send is considered successful:

0 : fire‑and‑forget; the producer does not wait for any acknowledgment, risking message loss.

1 (default): the producer waits only for the leader’s acknowledgment; if the leader crashes before followers replicate, the message can be lost.

all (or -1): the producer waits until all ISR followers have replicated the record, providing the strongest durability guarantee. However, if the ISR contains only the leader, this behaves like acks=1.

Choosing acks=all together with a replication factor of at least three yields high durability while balancing throughput.

4. Solving the Consumer Offset Problem

In the test environment the author used three brokers, a topic replication factor of three, six partitions, and acks=1. When one broker failed, consumption stopped because the internal __consumer_offset topic, which stores consumer offsets, had a default replication factor of 1 and its partitions were all on the failed broker.

To fix this:

Increase the replication factor of __consumer_offset by setting offsets.topic.replication.factor=3 in the broker configuration.

If necessary, delete the existing __consumer_offset data (e.g., by removing the log directories) so that the topic is recreated with the new replication factor.

After applying the change, the offset topic is replicated across all brokers, eliminating the single‑point‑of‑failure and restoring consumer progress after a broker crash.

In summary, Kafka’s HA relies on multi‑replica partitions, ISR‑based leader election, and appropriate acks settings; ensuring that critical internal topics like __consumer_offset also have sufficient replication is essential to avoid complete consumption outages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemshigh availabilityKafkaReplicationISRacks
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.