Big Data 12 min read

Understanding Kafka Consumer Group Rebalance and Timeout Mechanisms

This article explains how Kafka consumer groups assign partitions, the four situations that trigger a rebalance, the impact of consumer poll timeouts, and practical ways to tune max.poll.interval.ms and max.poll.records to avoid rebalance‑related errors.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Kafka Consumer Group Rebalance and Timeout Mechanisms

Kafka Consumer Group Basics

The article introduces Kafka consumer groups, where multiple consumers share a topic and each partition is uniquely assigned to a consumer. When a consumer crashes or a new consumer joins, partition assignment can become unfair, leading to load imbalance.

When Rebalance Is Triggered

Kafka initiates a rebalance in four cases: (1) membership changes in the consumer group, (2) a consumer fails to poll within the configured interval, (3) the subscribed topic set changes, and (4) the number of partitions for a subscribed topic changes.

1. Consumer Timeout Practice

The author creates eight consumer threads in a group named Test-Group, switches to manual offset commits, and inserts a 15‑second sleep after processing each batch to provoke a timeout.

public void consume() {
    try {
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(1000));
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("id = %d , partition = %d , offset = %d, key = %s, value = %s%n", id, record.partition(), record.offset(), record.key(), record.value());
            }
            try {
                TimeUnit.SECONDS.sleep(15);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            // manual offset commit
            consumer.commitSync();
        }
    } finally {
        consumer.close();
    }
}

Multiple consumers are started with:

public static void main(String[] args) throws InterruptedException {
    for (int i = 0; i < 8; i++) {
        final int id = i;
        new Thread() {
            @Override
            public void run() {
                new ReblanceConsumer(id).consume();
            }
        }.start();
    }
    TimeUnit.SECONDS.sleep(Long.MAX_VALUE);
}

During execution, the following poll‑timeout exception appears:

This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms…

When committing offsets after the timeout, another exception is thrown:

Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member…

Handling Consumer Timeouts

Kafka removes a consumer that exceeds max.poll.interval.ms because it cannot send heartbeats, causing the coordinator to consider it dead and trigger a rebalance. To avoid this, increase the interval, e.g.: props.put("max.poll.interval.ms", "60000"); or reduce the batch size: props.put("max.poll.records", "50"); Newer Kafka versions move heartbeat sending to a dedicated thread, but the interval still matters for detecting “hung” consumers.

2. Coordinator

Each consumer group has a coordinator that manages group membership and offset storage but does not assign partitions. Consumers send heartbeats; the coordinator checks liveness based on session.timeout.ms (default 10 s).

Election Mechanism

The coordinator broker is chosen by the formula Math.abs(hash(groupID)) % 50, where 50 is the internal partition count for consumer offsets.

3. Rebalance Process

The rebalance consists of two phases: Joining the Group (consumers request to join, a leader is randomly selected) and Synchronizing Group State (the leader gathers member info, computes partition assignments using RangeAssignor, RoundRobinAssignor, or StickyAssignor, and sends the result back via the coordinator).

4. Coordinator Lifecycle

The coordinator transitions through five states: Down, Initialize, Stable, Joining, and AwaitingSync. These states correspond to the steps of the rebalance workflow.

5. Generation Mechanism

Each rebalance increments a generation number. When a consumer attempts to commit offsets, it must include the current generation; the coordinator rejects commits from a consumer whose generation is outdated, preventing duplicate processing after a rebalance.

6. Leader Consumer

The leader consumer, chosen during the Joining phase, performs partition assignment and monitors topic changes; if a topic’s metadata changes, it notifies the coordinator to trigger another rebalance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKafkaTimeoutconsumer-groupoffset-commitrebalancemax.poll.interval.ms
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.