Big Data 9 min read

Why a Single Kafka Broker Crash Can Stop All Consumers – High‑Availability Explained

An in‑depth look at Kafka’s multi‑replica architecture, ISR mechanism, and ack settings reveals why a single broker failure—especially of the __consumer_offset topic—can halt consumer reads, and how proper replication factors and configuration can ensure true high‑availability.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Why a Single Kafka Broker Crash Can Stop All Consumers – High‑Availability Explained

The issue starts with a Kafka outage in a fintech company that uses Kafka instead of RabbitMQ for log processing.

Although the cluster runs stably, occasional consumer failures were observed when one of three broker nodes went down, causing the entire consumer group to stop receiving messages.

Kafka’s Multi‑Replica Redundancy Design

High availability in distributed systems like Zookeeper, Redis, Kafka, and HDFS is typically achieved through redundancy. Key Kafka concepts:

Broker (node) : a Kafka server instance.

Topic : a logical category for messages; producers send to a topic name, consumers read from it.

Partition : a topic can be split into one or more partitions; each partition resides on a broker.

Offset : the position of a message within a partition, used by consumers to track consumption.

When a broker fails, its partitions have replicas on other brokers. If the leader replica fails, a follower is elected as the new leader, allowing producers and consumers to continue operating.

Common questions include how many replicas are sufficient, what happens if followers are not fully synchronized, and the leader election rules after a node crash.

How many replicas are enough? More replicas increase availability but consume more resources; a replication factor of 3 is generally sufficient.

What if followers and the leader are not fully synchronized? Kafka uses an In‑Sync Replica (ISR) list. Only replicas in the ISR are considered up‑to‑date; out‑of‑sync followers are removed from the ISR.

Leader election after a node failure? Kafka relies on a controller to ensure a single leader. When a leader goes down, the controller selects the first replica in the ISR as the new leader.

Ack Parameter Determines Reliability

The producer’s request.required.acks setting controls how many acknowledgments are required for a send to be considered successful:

0 – fire‑and‑forget; messages may be lost.

1 – only the leader’s acknowledgment is required; if the leader crashes before followers sync, messages can be lost.

All (or -1) – all ISR replicas must acknowledge; this provides the highest durability but requires at least two in‑sync replicas.

Even with acks=All, if the ISR contains only the leader, the guarantee degrades to that of acks=1.

Resolving the Consumer Outage

In the test environment there are 3 brokers, a topic with replication factor 3, 6 partitions, and acks=1. When one broker fails, the cluster elects a new leader from the ISR. However, the built‑in __consumer_offset topic, which stores consumer offsets, defaults to a replication factor of 1 and 50 partitions, creating a single point of failure.

To fix the issue:

Delete the existing __consumer_offset topic (it cannot be removed via command; logs must be cleared).

Set offsets.topic.replication.factor=3 to recreate __consumer_offset with three replicas, ensuring offset data is replicated across brokers.

After adjusting the replication factor, consumer offsets are no longer a bottleneck, and the cluster remains available even when a broker crashes.

Author: JanusWoo Source: https://juejin.im/post/6874957625998606344

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityKafkaReplicationISRConsumer Offset
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.