Mastering Kafka Rebalance: Prevent Backlog, Duplicates, and Data Loss
When Kafka consumer groups rebalance, partitions are reassigned, often causing message backlog, duplicate processing, or loss; understanding the triggers, impact, and optimization techniques—like tuning timeouts, managing offset commits, and using sticky assignors—can keep your streaming pipelines reliable.
During a production incident the order‑topic in Kafka accumulated over 100,000 pending messages, causing downstream payment services to stall. Investigation revealed that although the consumer group had three nodes, only one was actively consuming because a rebalance had been triggered ten minutes earlier, leaving the other two nodes stuck in partition‑reassignment.
When Does Rebalance Occur?
Rebalance is the process of redistributing partitions among consumers in a group. It is triggered only when the mapping between consumers and partitions is broken. Common scenarios include:
1. Consumer count changes (most frequent)
Scale‑out: Adding a new consumer during a traffic peak forces a rebalance so that each of the three partitions can be handled by a separate consumer.
Scale‑down: When a consumer crashes, is network‑disconnected, or its pod is killed, the remaining consumers must take over its partitions, causing a rebalance.
In our logs we saw frequent pod restarts on Kubernetes due to resource exhaustion, each restart triggering a rebalance and worsening the backlog.
2. Topic partition count increases
Kafka does not support decreasing partitions. When new partitions are added, existing consumer groups do not automatically see them; a rebalance is required to assign the new partitions.
For example, expanding order-topic from 5 to 8 partitions leaves the original consumers processing only the first five until a rebalance distributes the three new partitions.
3. Subscribed topics change
When a consumer group calls subscribe() with a modified topic list—e.g., adding pay-topic alongside order-topic —Kafka triggers a rebalance to reassign partitions for all subscribed topics.
4. Heartbeat or poll timeout (hidden pitfalls)
Consumers send heartbeats to the coordinator to prove they are alive. Mis‑configured timeout parameters can cause false‑positive rebalances:
Heartbeat timeout: The default heartbeat interval is 3 seconds ( heartbeat.interval.ms). If no heartbeat is received for 45 seconds ( session.timeout.ms), the coordinator assumes the consumer is dead.
Poll timeout: If processing a batch takes longer than max.poll.interval.ms (default 5 minutes), the consumer is expelled from the group even if heartbeats are normal, triggering a rebalance.
We experienced this when processing large orders took six minutes, exceeding the default poll interval and causing frequent rebalances.
Problems Caused by Rebalance
1. Consumption pause and message backlog
All consumers pause while partitions are reassigned. In large groups (e.g., 100 consumers, 1,000 partitions) a rebalance can last dozens of seconds, during which messages continue to pile up and downstream services cannot read data.
2. Duplicate and lost messages
After a rebalance, if offsets were not committed promptly, consumers may restart from the last committed offset, causing either duplicate processing (if the same messages are read again) or loss (if uncommitted messages are skipped).
In extreme cases, coordinator failures can cause offset partition leadership changes, corrupting offset data and rolling back progress by days.
3. Resource waste and load imbalance
Frequent rebalances consume CPU and network resources on the Kafka cluster. Default partition assignment strategies (Range or RoundRobin) can lead to uneven load distribution, e.g., one consumer handling three partitions while another handles two, doubling its load and potentially triggering another rebalance.
When Can Data Be Lost?
Rebalance itself does not delete data, but combined with offset handling it can cause loss:
1. Auto‑commit + unfinished processing
Kafka auto‑commits offsets after auto.commit.interval.ms (default 5 seconds). If a rebalance occurs after the offset is committed but before processing finishes, the unprocessed messages are never consumed.
Consumer A polls offsets 100‑200 and auto‑commits 200 after 5 seconds.
During processing up to offset 150, the node crashes, triggering a rebalance.
Consumer B starts from offset 200, so messages 150‑199 are lost.
2. Manual commit at the wrong time
If a manual commit is performed before processing the batch, a rebalance can cause the new consumer to skip those messages.
Incorrect flow: commit offset → process messages.
Rebalance occurs after the commit but before processing, leading to loss.
The correct approach is to process first, then commit.
When Can Duplicate Consumption Occur?
Duplicate consumption is more common than loss, and also stems from offset‑commit timing:
1. Manual commit interrupted by rebalance
If a rebalance happens between processing completion and offset commit, the new consumer will re‑read the same range.
Consumer A finishes processing offsets 100‑200, but a heartbeat timeout kicks it out before committing.
Consumer B starts from offset 100, re‑processing 100‑200.
2. Poll timeout kicks out a consumer still processing
When processing exceeds max.poll.interval.ms, the consumer is considered dead and removed, even though it is still handling the batch.
Consumer A processes a large message for six minutes (exceeding the default 5‑minute limit) and is expelled.
Consumer B begins from the last committed offset, re‑processing the same messages.
3. Offset reset to earliest
If auto.offset.reset is set to earliest and the committed offset is unavailable (e.g., corrupted), a rebalance will cause the consumer to start from the beginning of the topic, replaying old messages.
How to Optimize Rebalance
1. Avoid frequent rebalances
Adjust timeout parameters based on processing time: increase max.poll.interval.ms (e.g., to 10 minutes for large messages) and set session.timeout.ms to 60‑120 seconds to reduce false death detection.
Stabilize consumer nodes: monitor CPU and memory, prevent frequent pod restarts in Kubernetes, and ensure host reliability.
2. Safe offset handling
Prefer manual commits and disable auto‑commit ( enable.auto.commit=false). After successful message processing, call commitSync().
For critical pipelines, use Kafka transactions to make message processing and offset commit atomic.
3. Optimize partition assignment
Use the sticky assignor to keep existing partition assignments as much as possible: set partition.assignment.strategy=StickyAssignor.
4. Improve consumer logic
Implement idempotent processing, e.g., use order IDs as unique keys so that even if a message is consumed twice, the business outcome remains correct.
Conclusion
Rebalance is a double‑edged sword for Kafka consumer groups: it can balance load when used correctly, but can also cause severe outages if mishandled. Remember:
Rebalance is triggered by consumer or partition changes, or timeout events.
Data loss and duplicate consumption stem from mismatched offset commits and rebalance timing.
Key mitigations include tuning timeouts, using manual commits, and ensuring idempotent processing.
Understanding and optimizing rebalance can turn a potential failure into a stable, high‑throughput streaming system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
