Backend Development 9 min read

Understanding Kafka High Availability: Replication, ISR, and Consumer Offset Issues

This article explains Kafka's high‑availability mechanisms—including multi‑replica design, ISR synchronization, and the impact of the request.required.acks setting—while diagnosing why a single‑replica __consumer_offset topic can cause the entire consumer group to stop when a broker fails.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Understanding Kafka High Availability: Replication, ISR, and Consumer Offset Issues

The discussion starts with a real‑world incident where one of three Kafka broker nodes crashed, causing all consumers to stop receiving messages, prompting an investigation into Kafka's high‑availability design.

Kafka's Multi‑Replica Redundancy Design

Kafka achieves HA through replication of partitions across multiple brokers. Key concepts include:

Broker (node) : a single Kafka server.

Topic : a logical category for messages; producers write to a topic, consumers read from it.

Partition : a topic is split into one or more partitions, each residing on a broker.

Offset : the position of a message within a partition, used by consumers to track progress.

With a replication factor of three, each partition has a leader and two followers. If a broker (and thus a leader) fails, a follower from the ISR (In‑Sync Replica) list is elected as the new leader, allowing producers and consumers to continue operating.

The article also addresses common questions: how many replicas are sufficient (typically three), how ISR works to ensure followers stay in sync, and the leader election process that relies on the controller to avoid split‑brain scenarios.

Ack Parameter Determines Reliability

The producer configuration request.required.acks controls how many acknowledgments are required for a send to be considered successful. The three possible values are:

0 – fire‑and‑forget; no guarantee of delivery.

1 – the leader must acknowledge; followers may lag, risking data loss if the leader crashes.

all (or -1) – all ISR followers must acknowledge, providing the strongest durability guarantee, provided at least two replicas remain in the ISR.

Setting acks=1 (Kafka's default) balances throughput and durability but does not guarantee HA in all failure scenarios.

Solution

In the author's test environment there were three brokers, a topic with replication factor 3 and six partitions, and acks=1 . When one broker went down, the cluster elected new leaders from the ISR, but the internal __consumer_offset topic—used to store consumer offsets—had a default replication factor of 1 and 50 partitions, creating a single‑point‑of‑failure.

To resolve the issue:

Delete the existing __consumer_offset topic (the built‑in topic cannot be removed directly; the author removed its log files).

Configure offsets.topic.replication.factor=3 so that the new __consumer_offset topic is replicated across all brokers, eliminating the outage when a broker fails.

These steps ensure that consumer offset information is also highly available, preventing the entire consumer group from stopping when a single broker crashes.

Author: JanusWoo Source: https://juejin.im/post/6874957625998606344

High AvailabilityKafkaReplicationISRAckconsumer offset
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.