Big Data 11 min read

Understanding Kafka High Availability: Causes of Outage and Practical Solutions

This article explains Kafka's multi-replica architecture, the role of leader and follower replicas, ISR mechanism, ack settings, and how misconfiguration of the __consumer_offsets topic can cause a full cluster outage, offering practical steps to restore high availability.

Top Architect

Dec 23, 2020

Understanding Kafka High Availability: Causes of Outage and Practical Solutions

Kafka Outage and High Availability Issues

The discussion starts with a real incident where one of three Kafka broker nodes crashed, causing all consumers to stop receiving messages despite the remaining two nodes being up.

Kafka's Multi‑Replica Redundancy Design

Kafka achieves high availability through replication of each partition across multiple brokers. A partition has one Leader replica and one or more Follower replicas; only the leader handles reads and writes, while followers synchronize data from the leader.

Physical Model

Logical Model

Broker (node): a single Kafka server.

Topic : a logical category for messages, identified by a Topic Name.

Partition : each topic is split into one or more partitions; a partition belongs to exactly one topic and is stored on a broker.

Offset : the sequential position of a message within a partition, used by consumers to track consumption.

Before Kafka 0.8 there was no replication; a broker failure meant loss of all partitions on that broker. Since 0.8, each partition is replicated, typically with a replication factor of three, providing fault tolerance.

When a broker fails, its leader partitions are re‑elected from the in‑sync replica (ISR) list. If the ISR is empty, a new leader is chosen from any surviving replica, which may risk data loss.

Ack Parameter Determines Reliability

The producer configuration request.required.acks controls how many acknowledgments are required before a send is considered successful:

0 : fire‑and‑forget; messages may be lost.

1 : only the leader must acknowledge; if the leader crashes before followers replicate, data can be lost (this is the default).

All (or -1): all ISR followers must replicate before the send succeeds, providing the strongest durability guarantee.

Even with All, if the ISR contains only the leader, the guarantee degrades to the same level as 1.

Resolving the Outage

The root cause was the internal __consumer_offsets topic, which by default has a replication factor of 1 and 50 partitions. When the broker holding its leader partition died, consumers could not read offsets and stopped processing.

Two corrective actions are recommended:

Delete the existing __consumer_offsets topic (it cannot be removed directly; the underlying log files must be cleared).

Set offsets.topic.replication.factor to 3 so that the __consumer_offsets topic is replicated across all brokers, eliminating the single‑point‑of‑failure.

After applying these settings, the cluster regains full high‑availability behavior even when a broker crashes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high availability Replication Message Queue

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.