Why Kafka Rebalance Causes Backlog, Duplicates, and Data Loss—and How to Fix It

Kafka consumer group rebalances can trigger message backlogs, duplicate processing, and data loss; this article explains common rebalance triggers, their impact on consumption, and practical configuration and coding strategies—such as tuning timeout parameters, using manual offset commits, and sticky partition assignment—to minimize disruptions.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
Why Kafka Rebalance Causes Backlog, Duplicates, and Data Loss—and How to Fix It

When Does Rebalance Trigger?

Rebalance is essentially the redistribution of partitions among consumers in a consumer group, and it only occurs when the consumer‑to‑partition mapping is broken. Common scenarios include:

1. Consumer count changes (most frequent)

Scaling up : Adding a consumer during a traffic peak forces a new partition‑to‑consumer mapping (e.g., three partitions originally handled by two consumers are reassigned when a third consumer joins).

Scaling down : A consumer crashes, loses network connectivity, or is killed, causing the remaining consumers to take over its partitions and triggering a rebalance.

In our logging service, frequent pod restarts on a K8s node caused repeated rebalances, leading to severe message backlog.

2. Topic partition count increases

Kafka does not support decreasing partitions. When partitions are added, existing consumer groups do not automatically detect them; a rebalance is required to assign the new partitions. For example, expanding order-topic from 5 to 8 partitions leaves the group consuming only the original 5 until a rebalance occurs.

3. Subscribed topics change

If a consumer group modifies its subscription list via subscribe() —for instance, adding pay-topic alongside order-topic —a rebalance redistributes all subscribed partitions.

4. Heartbeat or poll timeout (hidden pitfalls)

Heartbeat timeout : Consumers send a heartbeat every heartbeat.interval.ms (default 3 s). If no heartbeat is received within session.timeout.ms (default 45 s), the consumer is considered dead.

Poll timeout : If processing a batch exceeds max.poll.interval.ms (default 5 min), the consumer is expelled from the group even if heartbeats are normal.

In a large‑order processing scenario where each message took 6 minutes, the poll timeout caused frequent rebalances.

Problems Caused by Rebalance

1. Consumption pause and message backlog

During a rebalance, all consumers pause consumption while new partition assignments are elected and initialized. In large groups (e.g., 100 consumers, 1000 partitions), a rebalance can last dozens of seconds, causing messages to accumulate and downstream services to miss data.

2. Message duplication and loss

After a rebalance, if offsets are not committed promptly, consumers may start from the last committed offset, causing uncommitted messages to be either processed again or skipped. In extreme cases, coordinator failures can corrupt offset storage, rolling back progress by days.

3. Resource waste and load imbalance

Frequent rebalances consume CPU and network resources on the Kafka cluster. Default partition assignment strategies (Range or RoundRobin) can lead to uneven load distribution—for example, two consumers handling five partitions may end up with a 3‑vs‑2 split, causing one consumer to become a bottleneck and trigger further rebalances.

When Does Data Loss Occur?

1. Auto‑commit offset + processing not finished

Kafka’s default auto‑commit interval ( auto.commit.interval.ms) is 5 seconds. If a rebalance happens after an offset is auto‑committed but before the corresponding messages are processed, those messages are lost.

Consumer A polls offsets 100‑200 and auto‑commits offset 200 after 5 seconds.

While processing up to offset 150, the node crashes, triggering a rebalance.

Consumer B starts from offset 200, so messages 150‑199 are never processed.

2. Manual commit at the wrong time

Committing offsets before processing messages also leads to loss. If a rebalance occurs after the premature commit, the new consumer skips the unprocessed messages.

Incorrect logic: commit offset → then process message.

Risk: rebalance between commit and processing causes the new consumer to start after the committed offset, dropping the pending messages.

The correct approach is to process the message first, then commit the offset.

When Does Duplicate Consumption Occur?

Duplicate consumption is more common than data loss and also stems from offset‑commit timing.

1. Manual commit interrupted by rebalance

If a rebalance occurs between processing a batch and committing its offset, the new consumer will re‑read from the last committed position, duplicating the work.

Consumer A processes offsets 100‑200, then attempts to commit.

Heartbeat timeout kicks the consumer out before the commit succeeds.

Consumer B starts from offset 100, re‑processing messages 100‑200.

2. Poll timeout kicks out a consumer still processing

When processing exceeds max.poll.interval.ms, the consumer is considered dead and removed from the group, even though it continues processing locally. The new consumer starts from the last committed offset, causing the in‑flight messages to be processed again.

Consumer A takes 6 minutes on a large message (exceeds default 5 min).

Consumer A is expelled, later finishes processing and fails to commit.

Consumer B re‑consumes the same message, leading to duplication.

3. Offset not found, fallback to earliest

If auto.offset.reset is set to earliest and a rebalance occurs after offset data is lost or corrupted, the consumer will start from the earliest available message, replaying historic data.

How to Optimize Rebalance

1. Avoid frequent rebalances

Tune timeout parameters based on processing time: increase max.poll.interval.ms (e.g., to 10 minutes for large messages) and set session.timeout.ms to 60‑120 seconds to reduce false death detection.

Ensure consumer stability: monitor CPU and memory, prevent frequent pod restarts in K8s, and avoid node failures.

2. Safely handle offset commits

Disable auto‑commit ( enable.auto.commit=false) and use manual commits after successful processing, e.g., commitSync().

Consider Kafka transactions for exactly‑once semantics when duplicate consumption is unacceptable.

3. Optimize partition assignment

Use a sticky assignment strategy by setting partition.assignment.strategy=StickyAssignor, which tries to keep existing partition allocations during a rebalance, reducing movement.

4. Improve consumption logic

Implement idempotency (e.g., using order ID as a unique key) so that even if a message is processed multiple times, business logic remains correct.

Conclusion

Rebalances are triggered by changes in consumer count, partition count, subscribed topics, or timeout events.

Data loss and duplicate consumption stem from mismatched offset commit timing and rebalance occurrences.

Key mitigations include tuning timeout settings, using manual offset commits (or transactions), and ensuring idempotent processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaPerformance Tuningconsumer-groupData lossrebalanceOffset ManagementMessage Duplication
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.