Operations 12 min read

How to Monitor and Alert on RocketMQ Message Backlog and Failures

This guide explains how to use RocketMQ's observability metrics, tracing, and logging to configure effective monitoring and alerting for common production issues such as message backlog and send/receive failures, helping teams quickly detect, locate, and resolve problems.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How to Monitor and Alert on RocketMQ Message Backlog and Failures

RocketMQ, a widely used distributed middleware, powers core business pipelines where each message reflects critical data changes, making observability essential for reliable operations.

RocketMQ Observability Overview

The observability stack consists of three pillars: Metrics , Tracing , and Logging . Version 5.x expands the metric set with backlog, latency, error distribution, and storage I/O indicators, providing finer‑grained insight compared to 4.x.

Key Monitoring Scenarios

Instance resource‑level water‑mark alerts to avoid throttling.

Business‑logic error alerts to catch abnormal send/receive errors early.

Performance metrics (RT, latency) alerts to enforce service‑level expectations.

Configuring Alerts for Message Backlog

Backlog is measured by ready + inflight messages. RocketMQ 5.x adds dedicated delay‑time metrics that directly reflect consumption health: MessageProcessingDelay – time from message receipt to business processing completion. ReadyMessageQueueTime – time a ready message waits before being pulled.

Users should add these metrics to the instance’s monitoring page and set thresholds based on business tolerance.

Diagnosing Backlog Issues

Determine whether backlog resides on the server or client.

Check ons.log for "the cached message count exceeds the threshold" – indicates client‑side buffer saturation.

If absent, the backlog is server‑side; consider opening a support ticket.

Assess consumption latency.

Long latency suggests heavy business logic or I/O bottlenecks; examine stack traces of ConsumeMessageThread.

Normal latency but persistent backlog may require increasing consumer concurrency or scaling nodes.

Inspect client stack traces (e.g., via JStack) for blocking patterns such as lock contention, database calls, or external HTTP requests.

For non‑critical backlog, optionally reset the consumer offset to skip stale messages and resume from the latest position.

Preventing Backlog

During design, establish performance baselines by load‑testing consumption latency and concurrency. Optimize by reducing computational complexity, minimizing unnecessary I/O, and off‑loading heavy operations to asynchronous paths where safe.

Handling Message Send/Receive Failures

Failures often stem from API rate limits, network issues, broker restarts, or permission problems. Common error patterns include:

"messages flow control, flow limit threshold is ..." – indicates API TPS exceeding the instance quota.

"RemotingConnectException" or "RemotingTimeoutException" – network connectivity or latency problems.

"system busy, start flow control" – broker overload or resource contention.

Recommended mitigations:

Set API call rate alerts at ~70% of the instance’s maximum TPS.

Configure throttling‑count alerts to monitor how often the instance is rate‑limited.

Verify network paths, bandwidth, and JVM GC activity (e.g., frequent Full GC) that may introduce latency.

Implement robust retry and fallback logic in client code; if retries exceed limits, trigger alert and consider failover.

Conclusion

By leveraging RocketMQ’s enriched metric set, OpenTelemetry‑compatible tracing, and detailed error logging, operators can build precise monitoring rules that detect backlog and send/receive anomalies early, pinpoint root causes, and apply corrective actions to maintain high‑availability messaging services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.