Big Data 9 min read

Why a Single Kafka Broker Failure Can Halt All Consumers – Deep Dive into HA

This article explains Kafka's multi‑replica design, ISR mechanism, leader election rules, and producer acknowledgment settings, then shows how the built‑in __consumer_offset topic with a single replica can cause a whole cluster to become unavailable when one broker crashes, and offers practical fixes.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Single Kafka Broker Failure Can Halt All Consumers – Deep Dive into HA

Kafka's Multi‑Replica Redundancy Design

High availability in distributed systems like Kafka is achieved through redundancy. Kafka uses brokers (nodes), topics, partitions, and offsets to store and deliver messages.

Broker : a Kafka server instance.

Topic : a logical channel for messages.

Partition : a subdivision of a topic; each partition resides on a broker.

Offset : the position of a message within a partition.

When a broker fails, its partitions have replicas on other brokers. If the failed broker was the leader, a follower is elected as the new leader, allowing producers and consumers to continue operating.

Typical replication factor is 3, balancing fault tolerance and resource usage.

ISR (In‑Sync Replica) Mechanism

Followers are considered in‑sync only if they have caught up with the leader. The ISR list contains these followers; out‑of‑sync followers are removed from the list.

Leader Election After a Broker Crash

Kafka uses a controller to ensure a single leader per partition. When a leader disappears, the controller selects a new leader from the ISR list; if the ISR is empty, it picks any surviving replica, which may risk data loss.

Producer Acknowledgment Settings (acks)

The request.required.acks parameter controls reliability: 0: fire‑and‑forget, possible data loss. 1: wait for leader acknowledgment only (default). all (or -1): wait for all ISR replicas to acknowledge, providing the strongest durability guarantee.

Even with acks=all, if the ISR contains only the leader, durability degrades to the behavior of acks=1.

Root Cause: The __consumer_offset Topic

The internal __consumer_offset topic stores consumer offsets. By default it has a replication factor of 1 and 50 partitions, often all placed on a single broker, creating a single point of failure. When that broker goes down, consumers stop reading.

Solutions

Increase the replication factor of __consumer_offset by setting offsets.topic.replication.factor to 3.

Optionally delete the existing __consumer_offset topic (by removing its log files) and let Kafka recreate it with the new replication factor.

With the offset topic replicated across brokers, a single broker failure no longer blocks consumer progress.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityKafkaReplicationleader electionISRConsumer Offsets
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.