Stop Guessing: Kafka Message Backlog, Duplicates, and Loss Are Usually Caused by Rebalance
Kafka consumer issues such as message backlog, duplicate processing, and data loss often stem from consumer group rebalances triggered by changes in consumer count, partition count, subscription topics, or heartbeat and poll timeouts, and can be mitigated by tuning timeout settings, managing offset commits, and using sticky partition assignment.
When Does Rebalance Trigger?
Rebalance is the redistribution of partitions among consumers in a consumer group. It occurs only when the mapping between consumers and partitions is broken. Common scenarios include:
1. Consumer count changes (most frequent)
Scaling up: during a traffic peak a new consumer node is added, e.g., three partitions originally handled by two consumers are redistributed so each consumer gets one partition.
Scaling down: a consumer node crashes, loses network, or its process is killed, e.g., three consumers become two, forcing the remaining consumers to take over the missing partitions, which triggers a rebalance.
Our logging service once suffered from frequent pod restarts on a K8s node with insufficient resources; each restart caused a rebalance and the message backlog grew rapidly.
2. Topic partition count increased
Kafka does not support decreasing partitions, but when new partitions are added the existing consumer group does not automatically see them. A rebalance is required to assign the new partitions.
Example: expanding order-topic from 5 to 8 partitions. The consumer group continues to consume the original 5 partitions until a rebalance occurs, after which it starts handling the additional 3 partitions.
3. Subscribed topics changed
When the consumer group changes its subscription list via subscribe(), e.g., switching from only order-topic to both order-topic and pay-topic, a rebalance is triggered to redistribute all partitions of the newly subscribed topics.
4. Heartbeat or poll timeout (hidden pitfalls)
Consumers send heartbeats to the coordinator to prove they are alive. Misconfigured timeout parameters easily cause false‑positive rebalances:
Heartbeat timeout : the default heartbeat interval is 3 seconds ( heartbeat.interval.ms). If no heartbeat is sent for more than 45 seconds (default session.timeout.ms), the consumer is considered dead.
Poll timeout : the default max poll interval is 5 minutes ( max.poll.interval.ms). If processing a batch exceeds this limit, the consumer is kicked out of the group even if heartbeats are normal, causing a rebalance.
In our system, processing a large order message took 6 minutes, exceeding the default poll timeout and causing frequent rebalances.
Problems Caused by Rebalance
1. Consumption pause and backlog
During a rebalance all consumers pause consumption while new partition assignments are elected and initialized. In large groups (e.g., 100 consumers, 1000 partitions) a rebalance can last dozens of seconds, during which messages accumulate and downstream services cannot read data.
2. Duplicate and lost messages
After a rebalance, if offsets are not committed promptly, consumers may start from the last committed offset, causing uncommitted messages to be either processed again or skipped. In extreme cases, such as coordinator failure, the offset storage partition may switch leaders, corrupting offset data and rolling back progress by days.
3. Resource waste and load imbalance
Rebalances rely on the coordinator and consume CPU and network resources. Kafka’s default partition assignment strategies (Range or RoundRobin) can easily lead to uneven load. For example, five partitions assigned to two consumers may result in a 3‑vs‑2 partition split, doubling the load on one consumer, slowing it down, and potentially triggering another rebalance.
When Does Data Loss Occur?
1. Automatic offset commit + unfinished processing
Kafka’s default auto‑commit interval is 5 seconds ( auto.commit.interval.ms). If a consumer commits an offset before finishing processing the corresponding messages and a rebalance happens, the new consumer starts from the committed offset, and the in‑flight messages are lost.
Consumer A polls messages with offsets 100‑200 and auto‑commits offset 200 after 5 seconds.
While processing up to offset 150, the node crashes, triggering a rebalance.
Consumer B starts from offset 200; messages 150‑199 are never processed.
2. Manual commit at the wrong moment
If the offset is committed before the message is processed, a rebalance can cause the new consumer to skip the unprocessed messages, resulting in data loss.
Incorrect logic: commit offset → then process message.
Risk: rebalance occurs after commit but before processing, so the new consumer jumps ahead, leaving the message unhandled.
When Does Duplicate Consumption Occur?
1. Manual commit interrupted by rebalance
With manual commits, if a rebalance occurs between processing and committing, the offset is not persisted. The new consumer starts from the previous committed offset, re‑processing the same messages.
Consumer A finishes processing offsets 100‑200, prepares to commit, but is kicked out due to heartbeat timeout.
Consumer B begins from offset 100, causing the 100‑200 range to be consumed again.
2. Poll timeout kicks out a consumer that is still processing
If processing exceeds max.poll.interval.ms, the consumer is considered dead and removed from the group, even though it is still handling the message. The new consumer resumes from the last committed offset, leading to duplicate processing.
Consumer A processes a large message for 6 minutes (exceeding the 5‑minute default), gets kicked out.
Consumer B takes over the partition and starts from the last committed offset.
When Consumer A finally finishes and tries to commit, the commit fails because it is no longer part of the group, so the message is processed twice.
3. Offset reset to earliest causes replay
If auto.offset.reset is set to earliest (default is latest) and after a rebalance the stored offset cannot be found (e.g., corruption), the consumer starts from the earliest available message, replaying historic data.
How to Optimize Rebalance
1. Avoid frequent rebalances
Adjust timeout parameters based on processing latency: increase max.poll.interval.ms (e.g., to 10 minutes for large messages) and set session.timeout.ms to 60‑120 seconds to reduce false death detection. Keep consumer pods stable by monitoring CPU and memory, preventing frequent K8s restarts or server crashes.
2. Safe offset handling
Prefer manual commits and disable automatic commits ( enable.auto.commit=false). After a message is fully processed, call commitSync(). For workloads that cannot tolerate duplicates, use Kafka transactions to make message processing and offset commit atomic.
3. Optimize partition assignment
Use a sticky assignment strategy by setting partition.assignment.strategy to StickyAssignor. This keeps partitions on the same consumers across rebalances, minimizing movement.
4. Optimize consumption logic
Implement idempotency, e.g., use the order ID as a unique key so that even if a message is consumed twice, the business logic (such as charging or order creation) does not produce incorrect results.
Conclusion
Rebalance is triggered when consumers or partitions change, or when timeouts occur.
Data loss and duplicate consumption are fundamentally caused by mismatched offset commits and rebalance timing.
Mitigation focuses on tuning timeout parameters, handling offsets manually, and ensuring idempotent processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer XiaoFu
xiaofucode.com – a programmer learning guide driven by the pursuit of profit
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
