Operations 10 min read

Why Does One Kafka Broker Failure Halt All Consumers? HA & Replication Explained

The article examines Kafka’s high‑availability mechanisms, detailing its multi‑replica design, ISR synchronization, leader election, and the critical role of the __consumer_offset topic, and explains why a single broker outage can render the entire cluster unusable unless replication factors are properly configured.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Why Does One Kafka Broker Failure Halt All Consumers? HA & Replication Explained

1. Kafka Outage and HA Issue

The discussion starts with a Kafka broker crash that caused consumers to stop receiving messages, highlighting a puzzling high‑availability problem despite having three broker nodes.

2. Kafka Multi‑Replica Redundancy Design

Key concepts of Kafka are introduced:

Broker (node) : a Kafka server instance.

Topic : logical grouping of messages; producers and consumers use the topic name to write and read.

Partition : a topic is split into one or more partitions; each partition belongs to a single broker.

Offset : the position of a message within a partition, used by consumers to track progress.

3. Replication Mechanism

Since version 0.8 Kafka introduced replica sets for each partition. Each replica set contains one Leader and multiple Followers. Producers and consumers interact only with the Leader; Followers pull data from the Leader to stay synchronized. Kafka maintains an In‑Sync Replica (ISR) list of Followers that are sufficiently up‑to‑date. If a follower falls behind, it is removed from the ISR.

4. Ack Parameter and Reliability

The producer’s request.required.acks setting determines how many acknowledgments are required for a send to be considered successful:

0 – fire‑and‑forget; messages may be lost.

1 – only the Leader’s receipt is required; if the Leader fails before Followers sync, data can be lost.

All (or -1) – the Leader and all ISR Followers must acknowledge; this provides the strongest durability, but still requires at least two in‑sync replicas to avoid loss.

5. Root Cause: __consumer_offset Topic

The built‑in __consumer_offset topic stores consumer offsets. By default it has 50 partitions with a replication factor of 1, creating a single point of failure. When the broker holding these partitions crashes, all consumers stop processing.

6. Solution

Delete the existing __consumer_offset topic (e.g., by removing its log files) and let Kafka recreate it.

Set offsets.topic.replication.factor to 3 so the offset topic also has three replicas, matching the number of brokers.

7. Conclusion

When the replication factor of both regular topics and the __consumer_offset topic matches the number of brokers, Kafka’s multi‑replica design ensures high availability, and the cluster remains operational even after a broker failure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemshigh availabilityKafkaReplicationConsumer Offset
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.