Operations 9 min read

Why a Single Kafka Broker Failure Can Halt All Consumers – The HA Explained

This article explains Kafka's high‑availability mechanisms, covering multi‑replica design, ISR synchronization, leader election, acknowledgment settings, and the hidden risk of the __consumer_offset topic's single‑replica configuration that can cause an entire cluster to become unavailable when one broker fails.

Open Source Linux

Jul 18, 2021

Why a Single Kafka Broker Failure Can Halt All Consumers – The HA Explained

Kafka High‑Availability Overview

A fintech company uses Kafka, originally designed for log processing, as its core messaging system. Although the cluster runs stably, occasional consumer outages occur when one of the three broker nodes crashes, causing all consumers to stop receiving messages.

Multi‑Replica Redundancy Design

Kafka achieves high availability through replication of partitions across multiple brokers. When a broker goes down, its partitions still have replicas on other brokers, and a new leader is elected from the in‑sync replicas (ISR).

Key concepts:

Broker : a Kafka server node.

Topic : a logical category for messages; producers and consumers use the topic name to send and receive data.

Partition : each topic is split into one or more ordered partitions, each hosted on a broker.

Offset : the position of a message within a partition, used by consumers to track consumption.

Replication Factor and ISR

More replicas increase fault tolerance but consume more resources. A replication factor of three is typically sufficient for high availability. Kafka uses an ISR list to track followers that are sufficiently synchronized with the leader; only ISR members are eligible for leader election.

Acknowledgment (acks) Settings

The request.required.acks parameter controls durability: 0: fire‑and‑forget; messages may be lost. 1: only the leader must acknowledge; if the leader fails before followers sync, messages can be lost (default setting). all: all ISR replicas must acknowledge, providing the strongest durability guarantee as long as at least one follower remains in the ISR.

Hidden Single‑Point Failure: __consumer_offset Topic

The internal __consumer_offset topic stores consumer offsets. By default it has a replication factor of 1 and 50 partitions, making it a single‑point failure. If the broker holding its partitions crashes, all consumers stop.

Solution

1. Delete the existing __consumer_offset topic (cannot be removed via command; delete the log files).

2. Set offsets.topic.replication.factor=3 to create the topic with three replicas, ensuring offset data remains available after a broker failure.

Conclusion

Kafka’s multi‑replica architecture can provide high availability when replication factors match the number of brokers, but special attention is needed for internal topics like __consumer_offset that may default to a single replica. Adjusting the replication factor eliminates the outage caused by a single broker failure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems kafka replication Leader Election ISR Offsets

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.