How I Resolved an 8‑Million‑Message MQ Backlog at 2 AM: A Proven Generic Solution
At 2 AM an alert triggered when a RocketMQ queue surged from 500 K to 10 M messages, causing severe latency; the article walks through root‑cause analysis, a five‑step emergency fix, long‑term architectural upgrades, monitoring, and scripts to reliably eliminate such MQ backlogs.
Incident Overview
At 02:00 the monitoring system raised an MQ backlog alarm; within 15 minutes the accumulated messages grew from 500 K to 10 M, causing severe delays in order payment and logistics push notifications.
Root Cause
The immediate trigger was a night‑time bulk promotion that pushed 8 M marketing messages without traffic assessment. The architecture defect was a single‑threaded consumer model and an overly high alarm threshold (500 K messages).
Five‑Step Resolution Framework
Locate the cause
Classify the backlog source as production‑side (≈10 %), broker‑side (≈10 %), or consumer‑side (≈80 %).
Use RocketMQ Dashboard to check consumer heartbeat, queue distribution, and thread‑pool utilization.
Emergency stop‑gap (temporary consumer scaling)
Temporarily increase consumer instances; each MQ can be consumed by only one consumer, so scale up to the number of queues (e.g., 100 queues → up to 100 consumers).
Enable batch consumption by setting consumeMessageBatchMaxSize to 16‑32 (default 1). This can raise throughput from ~500 msg/s to ~8 000 msg/s.
Ensure idempotent processing to avoid duplicate handling when a consumer restarts.
Strategic de‑escalation (consumer downgrade)
Pause non‑critical consumers.
Reduce thread counts of non‑critical consumer groups to free resources for high‑priority flows.
Parallel blast (heavy‑weight measures)
If queue count limits scaling, create a temporary topic with 10× the original partitions, dump the backlog to it, and expand consumers accordingly.
After the backlog is cleared, switch back to the original topic.
Root‑cause architecture upgrade (long‑term fix)
Production side : implement dynamic throttling based on broker backlog (e.g., token‑bucket limiting TPS when backlog > 500 K).
Consumer side :
Optimize thread pools (CPU‑bound × 2, I/O‑bound × 5).
Split topics by business type and isolate thread pools with Hystrix or Resilience4j.
Enforce lightweight business logic (slow SQL < 50 ms, cache external calls).
Broker side :
Upgrade hardware to SSD (≥1 TB) and 10 GbE network.
Increase cluster size (nodes = production TPS ÷ 10 K) and set queue count = nodes × 8.
Enable async flush and configure fileReservedTime for appropriate message retention.
Monitoring & Alerting
Build multi‑dimensional metrics covering production TPS, consumer latency, thread‑pool utilization, and broker disk usage. Example thresholds:
Producer – message send TPS > 120 % of consumer max TPS → alert via RocketMQ Dashboard.
Consumer – consume latency > 10 s (normal) / > 30 s (spike) → alert via Prometheus + Grafana.
Consumer – thread‑pool utilization > 80 % (normal) / > 90 % (spike) → alert via JVM monitor (Arthas).
Broker – backlog size > 100 K (normal) / > 1 M (spike) → alert via RocketMQ Dashboard.
Broker – disk usage > 80 % (normal) / > 90 % (spike) → alert via Zabbix.
Dashboards should display backlog trends, production vs. consumption rates, and per‑node load distribution, with automated alerts when backlog grows while consumption lags.
Emergency Response Workflow
Standardize the process into four stages:
Detection (≤ 5 min)
Control (≤ 30 min)
Digestion (≤ 2 h)
Recovery (≤ 4 h)
Assign responsibilities (on‑call ops, consumer developers, business owners) and verify each stage with concrete metrics (e.g., backlog decreasing > 50 K/min).
Automation Scripts
# Quick consumer scaling script
for i in {1..50}; do
java -jar consumer.jar \
--consumerGroup order_consumer_tmp \
--namesrvAddr xxx:9876 \
--consumeThreadMin 50 \
--consumeThreadMax 50 \
--consumeMessageBatchMaxSize 16 \
--clientIP $(curl -s ifconfig.me) &
done
# Disable dead‑letter retry
curl -X POST http://rocketmq-dashboard:8080/consumer/update \
-H "Content-Type: application/json" \
-d '{"groupName":"order_consumer","maxReconsumeTimes":0}'Regular drills (monthly and pre‑peak rehearsals) simulate consumer thread‑pool saturation, broker I/O bottlenecks, and production spikes to validate the run‑book.
Technical Analysis Details
MQ backlog occurs when production rate > consumption rate + broker forwarding capacity . Because brokers are usually stable, the investigation should prioritize consumer‑side issues.
Consumer‑Side Diagnosis
Check for “fake death”: if a consumer instance has not reported a heartbeat for > 2 min, it is likely stuck (e.g., Full GC pause or infinite loop).
Verify queue load balance: each consumer should be assigned a similar number of queues; uneven distribution indicates that some instances are overloaded.
Inspect thread‑pool size: default consumeThreadMin / consumeThreadMax is 20. For I/O‑intensive workloads, set threads to CPU cores × 5; for CPU‑intensive workloads, set to CPU cores × 2. Over‑provisioning (e.g., > 32 threads for CPU‑bound tasks) can degrade performance due to context switching.
Enable batch consumption: set consumeMessageBatchMaxSize to 16‑32. Values > 32 give diminishing returns and increase memory pressure.
Ensure idempotent processing: batch consumption may cause duplicate handling after a restart; use unique message IDs or deduplication tables.
Broker‑Side Considerations
Hardware: use SSDs (random I/O 10× faster than HDD) and allocate ≥1 TB per broker.
Memory: allocate sufficient PageCache (e.g., 16 GB on a 32 GB machine) to reduce disk flush latency.
Network: deploy 10 GbE NICs and keep master‑slave nodes in the same data center (latency < 1 ms).
Cluster sizing: nodes = production TPS ÷ 10 K; queue count = nodes × 8 to ensure even distribution.
Flush strategy: use ASYNC_FLUSH for high‑throughput scenarios, accepting a small consistency window.
Retention: set fileReservedTime (e.g., 24 h for non‑core messages, 7 d for core messages) to avoid disk exhaustion.
Production‑Side Controls
Dynamic throttling: monitor broker backlog via the /topic/stats API; when backlog exceeds 500 K, reduce producer TPS with a token‑bucket (e.g., from 2000 TPS to 500 TPS).
Message slimming: keep only core fields in the payload; compress large payloads with Protobuf or GZIP; avoid logging‑type messages in MQ.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
