Big Data 9 min read

Why a Kafka Broker Crash Can Halt All Consumers – The Hidden Offset Pitfall

This article explains Kafka’s high‑availability design, including multi‑replica redundancy, ISR synchronization, leader election, and the critical role of the __consumer_offset internal topic, showing why a single broker failure can render the whole cluster unusable and how to configure replication factors and ack settings to prevent it.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Kafka Broker Crash Can Halt All Consumers – The Hidden Offset Pitfall

The story begins with a Kafka outage: one of three broker nodes went down, and despite the remaining two nodes being up, the entire consumer group stopped receiving messages.

Kafka’s Multi‑Replica Redundancy Design

High availability in distributed systems such as Kafka is achieved through redundancy. Key concepts include:

Broker : a Kafka server node.

Topic : a logical category for messages.

Partition : a subdivision of a topic; each partition resides on a broker.

Offset : the position of a message within a partition, used by consumers to track progress.

When a broker fails, its partitions have replicas on other brokers. If the leader replica fails, a follower from the ISR (In‑Sync Replica) list is elected as the new leader, allowing producers and consumers to continue operating.

However, questions arise: how many replicas are enough, what if followers are not fully synchronized, and what are the leader election rules after a node failure?

Replication factor of three is generally sufficient for high availability; more replicas increase resource consumption and may reduce performance.

Followers and leaders are kept in sync via the ISR mechanism. Each leader maintains an ISR list of followers that are sufficiently up‑to‑date. Followers that fall behind are removed from the ISR.

Leader election follows the same principle as other distributed systems (Zab, Raft, Viewstamped Replication, PacificA). Kafka selects a new leader from the ISR list; if the ISR is empty, it chooses any surviving replica, which may risk data loss. A controller ensures only one leader exists at a time.

Ack Parameter Determines Reliability

The producer configuration

request.required.acks

controls how many acknowledgments are required for a send to be considered successful. It has three possible values:

0

: The producer does not wait for any acknowledgment; messages may be lost.

1

(default): The producer waits only for the leader to acknowledge. If the leader crashes before followers replicate, the message can be lost.

All

(or

-1

): The producer waits for all in‑sync replicas to acknowledge, providing the strongest durability guarantee, but only when at least two ISR replicas exist.

Solution

The root cause was the internal topic

__consumer_offset

, which stores consumer offsets. By default it has a replication factor of 1 and 50 partitions, often all placed on a single broker, creating a single point of failure.

To fix the issue:

Delete the existing

__consumer_offset

topic (it cannot be removed via Kafka commands, so the log directory must be cleared).

Set

offsets.topic.replication.factor

to 3, ensuring the offset topic also has three replicas.

With the offset topic replicated across all brokers, a single broker failure no longer prevents consumers from reading offsets, and the cluster remains operational.

high availabilityKafkaReplicationISRconsumer offsetacks
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.