Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA
This article explains Kafka's high‑availability mechanisms, covering multi‑replica design, ISR synchronization, leader election, the impact of the request.required.acks setting, and how the default __consumer_offset topic can become a single point of failure, with concrete steps to fix it.
The story starts with a Kafka outage at a fintech company that uses Kafka instead of RabbitMQ, prompting an investigation into why a single broker failure caused all consumers to stop receiving messages.
Kafka's Multi‑Replica Redundancy Design
High availability in distributed systems like Zookeeper, Redis, Kafka, and HDFS is typically achieved through redundancy.
Key Kafka concepts:
Broker (node) : a Kafka server instance.
Topic : a logical category for messages; producers send to a topic name, consumers read from it.
Partition : each topic is split into one or more partitions, each belonging to a single broker.
Offset : the position of a message within a partition, used by consumers to track progress.
When a broker goes down, its partitions have replicas on other brokers. If the leader of a partition fails, a follower from the ISR (In‑Sync Replica) list is elected as the new leader, allowing producers and consumers to continue operating.
Typical replication factor of 3 provides a good balance between fault tolerance and resource usage.
Followers and leaders are not fully synchronous; they maintain an ISR list of replicas that are sufficiently up‑to‑date. If a follower falls behind, it is removed from the ISR.
Leader election after a broker failure follows the ISR list; if the ISR is empty, any surviving replica can become leader, though this may risk data loss.
Ack Parameter Determines Reliability
The producer configuration
request.required.ackscontrols how many acknowledgments are required for a send to be considered successful:
0: fire‑and‑forget; messages may be lost.
1(default): only the leader must acknowledge; if the leader crashes before followers replicate, data can be lost.
all(or
-1): the leader and all ISR followers must acknowledge, providing the strongest durability guarantee, but only when at least two ISR replicas exist.
Solving the Problem
The root cause was the internal
__consumer_offsettopic, which by default has a replication factor of 1 and 50 partitions, often all residing on a single broker. When that broker failed, consumers could no longer read offsets, halting consumption.
Fixes:
Delete the existing
__consumer_offsettopic (it cannot be removed directly, so delete its log files).
Set
offsets.topic.replication.factorto 3 in the broker configuration, ensuring the offset topic is replicated across three brokers.
After increasing the replication factor, the offset topic is no longer a single point of failure, and the cluster remains operational even when one broker is down.
Images illustrating the architecture:
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.