Why Does a Single Kafka Broker Failure Bring Down Your Consumers?
The article explains Kafka's high‑availability architecture, covering multi‑replica redundancy, ISR mechanisms, producer acknowledgment settings, and a real‑world case where a broker crash halted consumption due to the __consumer_offsets topic's replication factor, then offers concrete remediation steps.
Kafka High‑Availability Overview
The discussion begins with a Kafka outage in a fintech company that uses Kafka (instead of RabbitMQ) for log processing, highlighting a scenario where one of three broker nodes failed and the entire consumer group stopped receiving messages.
Multi‑Replica Redundancy Design
Kafka achieves high availability through replication of each partition across multiple brokers. Key concepts include:
Broker – a Kafka server instance (a physical node).
Topic – a logical channel identified by a name; producers write to a topic, consumers read from it.
Partition – each topic is split into one or more partitions; a partition resides on a single broker but can have replicas on other brokers.
Offset – the position of a message within a partition, used by consumers to track progress.
Before Kafka 0.8, there was no replication; a broker failure caused loss of all partitions on that node. Since 0.8, each partition has a Leader replica and one or more Follower replicas. Producers and consumers interact only with the leader; followers continuously sync data from the leader.
When a broker crashes, its partitions' leaders are re‑elected from the in‑sync replica (ISR) list. If the ISR list becomes empty, Kafka selects a surviving replica as the new leader, which may risk data loss.
Producer Acknowledgment (acks) Settings
The request.required.acks (or acks) parameter controls how many replicas must confirm a write before the producer considers it successful. The three possible values are:
0 – fire‑and‑forget; the producer does not wait for any acknowledgment, risking message loss.
1 – only the leader’s acknowledgment is required; if the leader fails before followers sync, the message may be lost (this is Kafka’s default).
all (or -1) – the write is considered successful only when all ISR replicas have replicated the message, providing the strongest durability guarantee.
Even with acks=all, if the ISR contains only the leader, the guarantee degrades to the same risk as acks=1.
Troubleshooting a Single Broker Failure
In a test environment with three brokers, a topic replication factor of 3, six partitions, and acks=1, the failure of one broker caused the entire consumer group to stop. The root cause was the internal __consumer_offsets topic, which by default has 50 partitions but a replication factor of 1, creating a single‑point‑of‑failure.
Resolution Steps
Delete the existing __consumer_offsets topic (the built‑in topic cannot be removed via CLI, so the log directories were manually deleted).
Configure offsets.topic.replication.factor=3 so that the __consumer_offsets topic is created with three replicas, matching the broker count.
After applying these changes, the cluster maintains consumer progress even when a broker goes down. The article also invites readers to discuss why the __consumer_offsets partitions were initially placed on a single broker.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
