Operations 10 min read

Kafka Outage and High Availability Mechanisms

This article examines a Kafka outage scenario in a fintech company, explains Kafka’s multi-replica redundancy design, leader‑follower architecture, ISR mechanism, and how misconfiguration of the __consumer_offset topic can cause cluster-wide consumer failures, and provides solutions to ensure true high availability.

Selected Java Interview Questions
Selected Java Interview Questions
Selected Java Interview Questions
Kafka Outage and High Availability Mechanisms

The article starts with a real‑world incident where one of three Kafka broker nodes crashed, causing all consumers to stop receiving messages despite the remaining two nodes being up.

It then describes Kafka’s multi‑replica redundancy design, covering the physical model (brokers) and logical model (topics, partitions, offsets). Each Broker is a Kafka server, a Topic groups messages, a Partition is a sub‑division of a topic, and an Offset identifies a message’s position within a partition.

From version 0.8 onward, Kafka introduced replication: every partition has one Leader replica and one or more Follower replicas. Clients only interact with the leader, while followers keep in‑sync via the ISR ( In‑Sync Replica ) mechanism, which maintains a list of followers that are sufficiently up‑to‑date.

The article also explains the producer acknowledgment setting request.required.acks (0, 1, all). Setting it to 0 provides no durability, 1 guarantees only the leader’s receipt, and all (or -1) requires all ISR members to acknowledge, offering the strongest durability guarantees.

A critical issue identified is the built‑in __consumer_offset topic, which by default has a replication factor of 1 and 50 partitions. If all its partitions reside on a single broker, the broker’s failure makes the entire consumer group stop, creating a single point of failure.

To resolve this, the article suggests two steps: (1) delete the problematic __consumer_offset topic by removing its log files, and (2) configure offsets.topic.replication.factor to 3 so that the offset topic also benefits from multi‑replica redundancy, matching the number of brokers.

Finally, it emphasizes that with proper replication (e.g., three replicas for three brokers) and correct acks settings, Kafka can achieve true high availability, while also noting the need to avoid configurations where more than half of the brokers are down, as Kafka will then halt operations.

high-availabilityKafkaReplicationISRAckconsumer offset
Selected Java Interview Questions
Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.