Why Do Message Queues Get Backlogged and How to Fix It Fast?
This article examines why message queues become backlogged—covering producer overload, broker persistence failures, and consumer bottlenecks—and outlines a step‑by‑step scaling and remediation strategy to restore smooth processing, including temporary queue expansion, load‑balanced forwarding, and post‑recovery cleanup.
Background
In the previous two chapters we introduced how message components ensure reliable transmission and ordered consumption, referring to series 11 and 12.
How to guarantee stability and reliability during message production.
How to ensure reliability from production through network transmission to broker receipt, typically using ACK mechanisms.
How to secure reliability of message consumption, including broker persistence and consumer processing. Issues in any of these steps can cause queue blockage, especially during traffic spikes such as promotional events, flash sales, or auctions.
Root Cause Analysis
2.1 Producer Overload
Message production volume can exceed expectations by several times, caused by:
Traffic spikes from events like 618, Double 11, auctions, flash sales (requires capacity planning).
Program defects such as infinite loops, batch requests, memory leaks leading to traffic surges (requires robust code).
2.2 Broker Reception and Persistence Failures
Broker servers may encounter service faults, network latency, or persistence failures, though these are relatively rare.
2.3 Consumer Capacity Degradation
Possible reasons include:
Massive retries after consumption failures causing backlog.
Consumer program faults such as deadlocks or I/O blocking.
Resource bottlenecks: while modern queues can handle tens of thousands of messages per second per node, insufficient capacity planning can lead to issues; scaling out broker instances usually resolves them.
Solutions for Message Backlog
When a backlog occurs, first identify which of the three causes applies.
Producer overload
Broker reception/persistence failure
Consumer capacity degradation
After pinpointing the cause, resolve the issue and temporarily expand capacity to process the accumulated messages. The concrete steps are:
Analyze the root cause; if the broker or consumer is faulty, restore it first; if the issue is with consumption logic, fix the program.
Pause current consumer processing.
Create temporary queues with 10× (or N×) the original number of partitions (new topics with increased partitions).
Develop a simple forwarding program to evenly distribute the backlogged messages into the expanded queues.
Scale consumers 10× accordingly, and also scale dependent services such as cache, database, and file storage.
After rapid consumption, revert to the original architecture to avoid resource waste.
Conclusion
The article presented potential causes of message backlog and basic remediation steps. Most middleware are robust; however, improper business usage and scaling are common triggers, so continuous monitoring of service changes is essential.
Architecture & Thinking
🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.