Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA

This article explains Kafka's high‑availability mechanisms, covering multi‑replica design, ISR synchronization, leader election, the impact of the request.required.acks setting, and how the default __consumer_offset topic can become a single point of failure, with concrete steps to fix it.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA

The story starts with a Kafka outage at a fintech company that uses Kafka instead of RabbitMQ, prompting an investigation into why a single broker failure caused all consumers to stop receiving messages.

Kafka's Multi‑Replica Redundancy Design

High availability in distributed systems like Zookeeper, Redis, Kafka, and HDFS is typically achieved through redundancy.

Key Kafka concepts:

Broker (node) : a Kafka server instance.

Topic : a logical category for messages; producers send to a topic name, consumers read from it.

Partition : each topic is split into one or more partitions, each belonging to a single broker.

Offset : the position of a message within a partition, used by consumers to track progress.

When a broker goes down, its partitions have replicas on other brokers. If the leader of a partition fails, a follower from the ISR (In‑Sync Replica) list is elected as the new leader, allowing producers and consumers to continue operating.

Typical replication factor of 3 provides a good balance between fault tolerance and resource usage.

Followers and leaders are not fully synchronous; they maintain an ISR list of replicas that are sufficiently up‑to‑date. If a follower falls behind, it is removed from the ISR.

Leader election after a broker failure follows the ISR list; if the ISR is empty, any surviving replica can become leader, though this may risk data loss.

Ack Parameter Determines Reliability

The producer configuration request.required.acks controls how many acknowledgments are required for a send to be considered successful: 0: fire‑and‑forget; messages may be lost. 1 (default): only the leader must acknowledge; if the leader crashes before followers replicate, data can be lost. all (or -1): the leader and all ISR followers must acknowledge, providing the strongest durability guarantee, but only when at least two ISR replicas exist.

Solving the Problem

The root cause was the internal __consumer_offset topic, which by default has a replication factor of 1 and 50 partitions, often all residing on a single broker. When that broker failed, consumers could no longer read offsets, halting consumption.

Fixes:

Delete the existing __consumer_offset topic (it cannot be removed directly, so delete its log files).

Set offsets.topic.replication.factor to 3 in the broker configuration, ensuring the offset topic is replicated across three brokers.

After increasing the replication factor, the offset topic is no longer a single point of failure, and the cluster remains operational even when one broker is down.

Images illustrating the architecture:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaReplicationdistributed-systemsleader electionhigh-availabilityconsumer-offset
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.