How to Observe RocketMQ Message Lifecycle with OpenTelemetry Metrics
This article explains how RocketMQ's message lifecycle can be fully observed using OpenTelemetry‑based metrics, covering producer, broker, and consumer stages, and shows practical monitoring, alerting, and troubleshooting practices for cloud‑native deployments.
Observability from the Message Lifecycle
RocketMQ stores messages in partitioned queues, enabling many‑to‑many mappings between producers, consumers, and queues, which scale horizontally. Observability is achieved by defining clear metrics for each phase of a message’s life, from sending to consumption acknowledgment.
Message Lifecycle Stages and Relevant Metrics
Send Phase : Time from producer dispatch to broker persisting the message on disk; includes delay for scheduled messages.
Broker Processing : Handling based on message type; supports message piling where delivery follows consumer consumption capacity.
Consumer Pull : Network and server processing time from pull request to client receipt.
Queueing : Wait time for processing resources after the message reaches the client.
Consumption : Time from processing start to commit/ACK.
Each stage can be instrumented, providing a comprehensive set of metrics for monitoring, alerting, capacity analysis, and fault diagnosis.
From Exporter to OpenTelemetry‑Based Metrics
The original RocketMQ 4.x exporter, contributed by the RocketMQ team, exposed broker, producer, and consumer metrics but suffered from several drawbacks:
Inability to monitor new modules such as Proxy introduced in RocketMQ 5.x.
Metric definitions not aligned with open‑source observability standards.
Excessive RPC calls adding load to brokers.
Poor extensibility; adding or changing metrics required modifying the broker admin interface.
To address these issues, the RocketMQ community adopted OpenTelemetry in version 5.x, redesigning metrics with Prometheus‑compatible types (Counter, Gauge, Histogram) and naming conventions. The new metrics cover broker, proxy, producer, and consumer modules, offering full‑lifecycle visibility.
Metric Reporting Modes
Three ways to expose metrics are supported:
Pull Mode : Prometheus scrapes metrics directly from broker/proxy endpoints using Kubernetes service discovery (PodMonitor/ServiceMonitor). No extra components are required.
Push Mode : Metrics are sent to an OpenTelemetry Collector, which can forward them to cloud observability services (e.g., AWS CloudWatch, Alibaba Cloud SLS).
Exporter Compatibility Mode : Existing exporter deployments can consume the new metrics without architectural changes, acting as a proxy for Prometheus.
Building a Monitoring System – Best Practices
After collecting metrics in Prometheus, configure dashboards and alerts. Typical use cases include:
Interface Monitoring : RPC latency (avg, p90, p99), success rate, error reasons, request/response distribution.
Client Monitoring : Connection count, client version/language distribution, message size/type distribution.
Broker Monitoring : Dispatch latency, message retention time, thread‑pool queue depth, message backlog.
These examples illustrate only a fraction of available metrics; users should combine them according to business needs.
Alert Configuration Example
Configure an alert on rocketmq_broker_dispatch_latency to trigger when dispatch delay exceeds a business‑defined threshold. Upon alert, investigate related metrics such as producer failure rate and subscription group creation errors.
Diagnosing Message Backlog Using Metrics
Backlog analysis focuses on two lifecycle stages:
Ready Messages : Messages stored on the broker, awaiting consumption.
In‑flight Messages : Messages already pulled by consumers but not yet acknowledged.
Key metrics:
rocketmq_consumer_ready_messages rocketmq_consumer_inflight_messages rocketmq_consumer_lag_latency rocketmq_process_timeTypical scenarios:
Ready messages rise while in‑flight messages hit the client limit – consumer capacity is insufficient; scale out consumers or increase processing threads.
Ready messages near zero but in‑flight messages grow – often caused by a stuck consumer thread in 4.x clients; examine rocketmq_process_time to locate slow processing.
Ready messages rise while in‑flight messages stay low – consumers are not pulling; check ACL credentials, client hangs, or broker response latency (e.g., disk IOPS saturation).
Images
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
