Why Kafka Consumer Rebalance Stops Your Apps and How to Fix It
This article explains Kafka consumer group rebalance, covering core concepts, trigger conditions, detailed protocol steps, common pitfalls like long pause times, and modern improvements such as static membership and incremental cooperative rebalance, plus practical configuration tips to minimize disruptions.
Introduction
Message queues are essential for server-side systems, and Kafka is a leading choice. Understanding Kafka's design, especially consumer group rebalance, is crucial for diagnosing consumption delays.
Key Concepts
Consumer Group
A consumer group provides scalable, fault‑tolerant consumption. Multiple consumer instances share a group ID and collectively consume all partitions of a topic.
One or more consumer instances per group (process or thread).
group.id uniquely identifies the group.
Each partition of a subscribed topic is assigned to only one consumer within the group.
Group Coordinator
The Group Coordinator service runs on a broker, stores group metadata, and writes offsets to the internal __consumer_offsets topic. It is selected based on the hash of the group ID.
Partition calculation: partition‑Id(__consumer_offsets) = Math.abs(groupId.hashCode() % groupMetadataTopicPartitionCount) (default 50 partitions).
Rebalance Purpose
When consumers join or leave a group, partition assignments must be recomputed; otherwise some partitions may have no consumer or some consumers may have no partitions.
Rebalance Triggers
Change in the number of group members (join/leave).
Change in the number of partitions of a subscribed topic.
Subscription changes (add/remove topics).
Rebalance Process (Kafka 1.1.1)
Consumer sends FindCoordinatorRequest to any broker to discover the Group Coordinator.
Broker returns ConsumerMetadataResponse with the coordinator address.
Consumer sends JoinGroupRequest to the coordinator.
Coordinator stores the request and waits for all members.
Coordinator selects a group leader and decides the partition assignment strategy, then returns JoinGroupResponse to the leader.
All consumers receive JoinGroupResponse; only the leader gets full member info.
All consumers send SyncGroupRequest. The leader includes the assignment; others send empty requests.
Coordinator returns SyncGroupResponse with the final partition mapping.
Consumers continue sending heartbeats ( heartbeat.interval.ms). If a rebalance is in progress, the coordinator may return an IllegalGeneration error, prompting a rejoin.
Rebalance Problems
During rebalance, all partitions are revoked, causing a pause in consumption. If a consumer does not send a JoinGroupRequest before max.poll.interval.ms (default 5 minutes), the rebalance can last up to that timeout, severely impacting production.
Rebalance Improvements
Static Membership (Kafka 2.3+)
Consumers can set group.instance.id as a unique identifier. The coordinator records the mapping and, when the same static member rejoins, it reuses the previous partition assignment, avoiding a full rebalance. Rebalance occurs only for new members, leader rejoin, member timeout, or explicit leave.
Incremental Cooperative Rebalancing
This approach splits a full rebalance into multiple small, cooperative steps:
Consumers compare old and new assignments and only revoke partitions that changed.
Multiple rounds of partial rebalances eventually achieve the global assignment without stopping consumption.
The article illustrates three rounds of incremental rebalance with example consumers C1, C2, C3, showing how delays and sync responses allow continuous processing.
Practical Recommendations
Optimize business logic to reduce processing time of problematic messages.
Set max.poll.interval.ms larger than the worst‑case processing time.
Configure session.timeout.ms and heartbeat.interval.ms such that session.timeout.ms ≥ 3 × heartbeat.interval.ms (e.g., 6 s and 2 s) to detect dead consumers quickly.
Consider enabling static membership and incremental cooperative rebalance for smoother scaling.
By understanding and tuning these mechanisms, you can prevent consumer stalls and maintain high throughput in Kafka deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
