Backend Development 11 min read

Why Does a Single Kafka Broker Failure Break All Consumers?

A Kafka broker outage can halt consumer consumption despite remaining brokers, due to replication settings, ISR mechanics, and the internal __consumer_offsets topic’s default replication factor, which this article explains and resolves with practical configuration steps.

Senior Brother's Insights

Mar 29, 2021

Why Does a Single Kafka Broker Failure Break All Consumers?

1. High Availability Reflections Triggered by Kafka Outage

In a fintech company that uses Kafka for log processing, a broker unexpectedly went down. Although two brokers were still running, all consumers stopped receiving messages, prompting an investigation into Kafka’s high‑availability mechanisms.

2. Kafka’s Multi‑Replica Redundancy Design

Redundancy is the common strategy for achieving high availability in distributed systems such as Zookeeper, Redis, Kafka, and HDFS. The article first clarifies four core Kafka concepts:

Broker : a Kafka server, i.e., a physical node.

Topic : a logical category for messages; producers send to a topic name and consumers read from the same name.

Partition : a topic can be split into one or more partitions; each partition belongs to a single broker and preserves message order.

Offset : the position of a message within a partition, used by consumers to track consumption.

Before version 0.8 Kafka had no replication; a broker failure caused data loss for its partitions. Starting with 0.8, each partition is replicated across multiple brokers, forming a leader replica and one or more follower replicas. Producers and consumers interact only with the leader; followers pull data from the leader to stay synchronized.

If a broker fails, the cluster detects the loss of the partition’s leader and elects a new leader from the ISR (In‑Sync Replica) list. If the ISR list becomes empty, a non‑in‑sync replica may be chosen, which can lead to potential data loss.

Key practical questions addressed:

How many replicas are sufficient? Typically three replicas provide a good balance of availability and resource consumption.

What if followers are not fully synchronized? Kafka maintains an ISR list; only replicas that are sufficiently up‑to‑date remain in ISR, ensuring the new leader is in sync.

What are the leader election rules after a failure? The controller selects the first ISR replica as the new leader, guaranteeing a single leader at any time.

3. Ack Parameter Determines Reliability

The producer’s request.required.acks setting controls durability:

0 : fire‑and‑forget; the producer does not wait for any acknowledgment, risking message loss.

1 : the producer waits only for the leader’s acknowledgment; if the leader crashes before followers sync, the message may be lost.

all (or -1): the producer waits for acknowledgment from all in‑sync replicas, providing the strongest durability guarantee, provided at least two replicas remain in ISR.

Kafka’s default is acks=1, a compromise between throughput and safety.

4. Solving the Consumer Outage

In the test environment there are three brokers, a topic with replication factor 3, six partitions, and acks=1. When one broker goes down, consumers stop because the internal __consumer_offsets topic—used to store consumer offsets—has a default replication factor of 1, creating a single point of failure.

Resolution steps:

Delete the existing __consumer_offsets topic. Since it cannot be removed via Kafka commands, its log files are manually deleted.

Set offsets.topic.replication.factor=3 in the broker configuration and restart the cluster, causing Kafka to recreate __consumer_offsets with three replicas.

After increasing the replication factor, the cluster remains operational as long as a majority of brokers are alive, eliminating the consumer‑stop issue caused by the single‑replica internal topic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend high availability Kafka Replication ISR Consumer Offsets ACK

Written by

Senior Brother's Insights

A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.