Why a Single Kafka Broker Crash Can Halt All Consumers – The HA Explained
An in‑depth look at Kafka’s high‑availability architecture reveals how multi‑replica redundancy, ISR mechanisms, and the configuration of the __consumer_offset topic interact, explaining why a single broker failure can render the entire cluster unusable and how to properly configure replication and ack settings to prevent it.
1. High‑Availability Issues Triggered by a Kafka Outage
The problem starts with a Kafka node failure. In a fintech company using Kafka instead of RabbitMQ, the cluster ran stably until a consumer stopped receiving messages because one of three broker nodes went down, causing the entire consumer group to lose access.
2. Kafka’s Multi‑Replica Redundancy Design
High availability in distributed systems such as ZooKeeper, Redis, Kafka, and HDFS is typically achieved through redundancy. Key Kafka concepts are:
Broker (node) : a Kafka server, i.e., a physical node.
Topic : a logical category for messages; producers send to a topic name and consumers read from it.
Partition : each topic is split into one or more partitions; a partition belongs to a single broker and holds an ordered queue of messages.
Offset : the position of a message within a partition; consumers use offsets to read messages.
Before Kafka 0.8 there was no replica mechanism, so a broker crash caused data loss for its partitions. Since 0.8, each partition has a leader replica and multiple follower replicas. Producers and consumers interact only with the leader; followers asynchronously sync data from the leader.
When a broker fails, its partitions’ leaders are re‑elected from the ISR (In‑Sync Replica) list. If the ISR is empty, a new leader is chosen from surviving replicas, which may risk data loss.
3. Ack Parameter Determines Reliability
Setting ack=0
The producer sends the message and does not wait for any acknowledgment, so the message may be lost if the broker crashes.
Setting ack=1
The producer considers the send successful once the leader acknowledges receipt; followers may not have synced yet, so a leader crash can still cause loss. This is Kafka’s default.
Setting ack=all (or -1)
The send is successful only when the leader and all ISR followers have replicated the message. However, if the ISR contains only the leader, ack=all behaves like ack=1.
4. Solving the Problem
In the test environment there were 3 brokers, a replication factor of 3 for the topic, 6 partitions, and ack=1. When one broker went down, the cluster elected a new leader from the ISR, but the internal __consumer_offset topic had a replication factor of 1, making it a single point of failure. Deleting the __consumer_offset topic is not possible directly; the author removed its logs to delete it, then set
offsets.topic.replication.factor=3to give it three replicas.
After replicating __consumer_offset, a broker crash no longer stops consumer consumption.
Delete the __consumer_offset topic (by removing its log files).
Configure
offsets.topic.replication.factorto 3 to replicate the offset topic.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.