Root Cause Analysis of Kafka Consumer Group Coordinator Failure and __consumer_offsets Compaction Issues
The article investigates a Kafka cluster outage where several brokers became unavailable and consumers could not join groups, explains the role of __consumer_offsets, analyzes the coordinator selection logic, identifies a stuck loadGroupsForPartition operation and compact thread failure, and documents the recovery steps taken.
After a Chinese New Year weekend outage that left eight Kafka brokers down and many consumers stuck with the error Attempt to join group xxxx failed due to obsolete coordinator information, retrying , the team performed a deep dive into Kafka's consumer‑group mechanics.
Kafka does not have separate queue and topic concepts; instead it uses consumer groups where each group’s members are coordinated by a GroupCoordinator broker. The coordinator is chosen based on the hash of the consumer‑group name modulo the number of partitions of the internal __consumer_offsets topic. The chosen partition’s leader becomes the GroupCoordinator.
During the incident the broker that should have acted as the coordinator denied its role, which led to the join‑group request failing. Inspection of the source code revealed the method responsible for confirming coordinator status:
def isGroupLocal(groupId: String): Boolean = loadingPartitions synchronized ownedPartitions.contains(partitionFor(groupId))
The ownedPartitions set is populated only after a broker successfully processes a LeaderAndIsrRequest for a __consumer_offsets partition and runs loadGroupsForPartition . In this case the loading thread was stuck in a file‑read loop ( messages.readInto(buffer, 0) ) consuming 100% CPU, preventing the partition from being marked as owned and thus blocking coordinator recognition.
To recover, the team first tried deleting the problematic __consumer_offsets partition on one broker and restarting, but the replica leader kept syncing and the issue persisted. They then removed only the corrupted segment(s) of the partition, restarted the broker, and allowed the remaining healthy segments to be replicated, after which consumers resumed normal operation.
Further investigation showed that the __consumer_offsets topic, which stores consumer offsets, had grown to hundreds of gigabytes because the log‑cleaner (compact) thread was failing. The compact process keeps only the latest offset per key by loading an in‑memory map (OffsetMap) sized by log.cleaner.dedupe.buffer.size . With the default 128 MB buffer and a single cleaner thread, the map could not hold all entries from a large segment, causing the cleaner thread to crash and the topic to stop compacting.
Kafka 0.9.0.1, the version used, required the entire segment to fit in memory for compaction; later versions improved this by allowing partial‑segment compaction. The lack of monitoring for the cleaner thread meant the problem went unnoticed until it impacted consumer start‑up latency and cluster stability.
The article concludes that deep code inspection is essential for diagnosing complex Kafka issues and recommends monitoring the log‑cleaner, tuning log.cleaner.dedupe.buffer.size and log.cleaner.threads , and regularly reviewing __consumer_offsets health.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.