Backend Development 6 min read

Why Do Message Queues Get Backlogged and How to Fix It Fast?

This article examines why message queues become backlogged—covering producer overload, broker persistence failures, and consumer bottlenecks—and outlines a step‑by‑step scaling and remediation strategy to restore smooth processing, including temporary queue expansion, load‑balanced forwarding, and post‑recovery cleanup.

Architecture & Thinking

Jun 9, 2023

Why Do Message Queues Get Backlogged and How to Fix It Fast?

Background

In the previous two chapters we introduced how message components ensure reliable transmission and ordered consumption, referring to series 11 and 12.

How to guarantee stability and reliability during message production.

How to ensure reliability from production through network transmission to broker receipt, typically using ACK mechanisms.

How to secure reliability of message consumption, including broker persistence and consumer processing. Issues in any of these steps can cause queue blockage, especially during traffic spikes such as promotional events, flash sales, or auctions.

Root Cause Analysis

2.1 Producer Overload

Message production volume can exceed expectations by several times, caused by:

Traffic spikes from events like 618, Double 11, auctions, flash sales (requires capacity planning).

Program defects such as infinite loops, batch requests, memory leaks leading to traffic surges (requires robust code).

2.2 Broker Reception and Persistence Failures

Broker servers may encounter service faults, network latency, or persistence failures, though these are relatively rare.

2.3 Consumer Capacity Degradation

Possible reasons include:

Massive retries after consumption failures causing backlog.

Consumer program faults such as deadlocks or I/O blocking.

Resource bottlenecks: while modern queues can handle tens of thousands of messages per second per node, insufficient capacity planning can lead to issues; scaling out broker instances usually resolves them.

Solutions for Message Backlog

When a backlog occurs, first identify which of the three causes applies.

Producer overload

Broker reception/persistence failure

Consumer capacity degradation

After pinpointing the cause, resolve the issue and temporarily expand capacity to process the accumulated messages. The concrete steps are:

Analyze the root cause; if the broker or consumer is faulty, restore it first; if the issue is with consumption logic, fix the program.

Pause current consumer processing.

Create temporary queues with 10× (or N×) the original number of partitions (new topics with increased partitions).

Develop a simple forwarding program to evenly distribute the backlogged messages into the expanded queues.

Scale consumers 10× accordingly, and also scale dependent services such as cache, database, and file storage.

After rapid consumption, revert to the original architecture to avoid resource waste.

Conclusion

The article presented potential causes of message backlog and basic remediation steps. Most middleware are robust; however, improper business usage and scaling are common triggers, so continuous monitoring of service changes is essential.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations scaling Backlog

Written by

Architecture & Thinking

🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.