Understanding Kafka Consumer Group Rebalance and Timeout Mechanisms
This article explains how Kafka consumer groups assign partitions, the four situations that trigger a rebalance, the impact of consumer poll timeouts, and practical ways to tune max.poll.interval.ms and max.poll.records to avoid rebalance‑related errors.
Kafka Consumer Group Basics
The article introduces Kafka consumer groups, where multiple consumers share a topic and each partition is uniquely assigned to a consumer. When a consumer crashes or a new consumer joins, partition assignment can become unfair, leading to load imbalance.
When Rebalance Is Triggered
Kafka initiates a rebalance in four cases: (1) membership changes in the consumer group, (2) a consumer fails to poll within the configured interval, (3) the subscribed topic set changes, and (4) the number of partitions for a subscribed topic changes.
1. Consumer Timeout Practice
The author creates eight consumer threads in a group named Test-Group, switches to manual offset commits, and inserts a 15‑second sleep after processing each batch to provoke a timeout.
public void consume() {
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(1000));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("id = %d , partition = %d , offset = %d, key = %s, value = %s%n", id, record.partition(), record.offset(), record.key(), record.value());
}
try {
TimeUnit.SECONDS.sleep(15);
} catch (InterruptedException e) {
e.printStackTrace();
}
// manual offset commit
consumer.commitSync();
}
} finally {
consumer.close();
}
}Multiple consumers are started with:
public static void main(String[] args) throws InterruptedException {
for (int i = 0; i < 8; i++) {
final int id = i;
new Thread() {
@Override
public void run() {
new ReblanceConsumer(id).consume();
}
}.start();
}
TimeUnit.SECONDS.sleep(Long.MAX_VALUE);
}During execution, the following poll‑timeout exception appears:
This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms…
When committing offsets after the timeout, another exception is thrown:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member…
Handling Consumer Timeouts
Kafka removes a consumer that exceeds max.poll.interval.ms because it cannot send heartbeats, causing the coordinator to consider it dead and trigger a rebalance. To avoid this, increase the interval, e.g.: props.put("max.poll.interval.ms", "60000"); or reduce the batch size: props.put("max.poll.records", "50"); Newer Kafka versions move heartbeat sending to a dedicated thread, but the interval still matters for detecting “hung” consumers.
2. Coordinator
Each consumer group has a coordinator that manages group membership and offset storage but does not assign partitions. Consumers send heartbeats; the coordinator checks liveness based on session.timeout.ms (default 10 s).
Election Mechanism
The coordinator broker is chosen by the formula Math.abs(hash(groupID)) % 50, where 50 is the internal partition count for consumer offsets.
3. Rebalance Process
The rebalance consists of two phases: Joining the Group (consumers request to join, a leader is randomly selected) and Synchronizing Group State (the leader gathers member info, computes partition assignments using RangeAssignor, RoundRobinAssignor, or StickyAssignor, and sends the result back via the coordinator).
4. Coordinator Lifecycle
The coordinator transitions through five states: Down, Initialize, Stable, Joining, and AwaitingSync. These states correspond to the steps of the rebalance workflow.
5. Generation Mechanism
Each rebalance increments a generation number. When a consumer attempts to commit offsets, it must include the current generation; the coordinator rejects commits from a consumer whose generation is outdated, preventing duplicate processing after a rebalance.
6. Leader Consumer
The leader consumer, chosen during the Joining phase, performs partition assignment and monitors topic changes; if a topic’s metadata changes, it notifies the coordinator to trigger another rebalance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
