How RocketMQ Harnesses Prometheus for Full‑Stack Observability
This article explains how RocketMQ integrates with Prometheus and Grafana to provide comprehensive metrics, tracing, and logging, detailing the exporter architecture, deployment choices, span topology, dashboard examples, and ARMS‑based alerting for cloud‑native message‑queue observability.
RocketMQ, Alibaba's high‑performance messaging platform, is presented as a flagship cloud‑native product with a mature observability solution built on Prometheus. The article outlines the three pillars of observability—Metrics, Tracing, and Logging—and shows how RocketMQ implements each.
Metrics
RocketMQ ships a ready‑to‑use Prometheus exporter and Grafana dashboards that expose message volume, backlog, latency at each processing stage, and other key indicators. The exporter periodically pulls data from the MQ cluster via MQAdminExt, normalises it, and exposes it on an HTTP endpoint for Prometheus to scrape.
Tracing
OpenTelemetry tracing is integrated on both client and server sides. Clients embed an OpenTelemetry exporter that batches spans to a proxy (C‑Broker). The proxy acts as a collector, merging client‑side and its own spans. Users can configure custom collectors, use the commercial hosted store, or run an open‑source backend. A redesigned span topology models the message lifecycle (Prod, Recv, Await, Proc, ACK/Nack) and conforms to the OpenTelemetry specification.
Logging
Standardised client‑side logging simplifies issue localisation by providing consistent log formats across producers, brokers, and consumers.
Exporter Deployment Choices
Two deployment modes are discussed: embedding the Prometheus client directly in the application (low overhead, no extra components) versus running a separate exporter process (decoupled, easier for third‑party code). The recommendation is to embed the client when you control the code, otherwise use the exporter.
High‑Cardinality Mitigation
Because RocketMQ metrics can include many labels (tenant, instance, topic, consumer group, etc.), the article advises limiting label explosion to avoid excessive series count, storage cost, and query performance degradation. Specific optimisations were applied to the native Prometheus client to control memory usage.
Multi‑Tenant Monitoring
In production, each tenant’s RocketMQ resources are isolated. Deploying a dedicated exporter per tenant would be impractical, so RocketMQ adopts a shared exporter approach that tags metrics with tenant identifiers, enabling per‑tenant monitoring without proliferating exporter instances.
Full‑Link Tracing
The tracing flow consists of:
Client‑side OpenTelemetry exporter sending spans to the proxy.
Proxy acting as a collector for both client and its own spans.
Optional storage back‑ends (custom collector, commercial hosted store, or self‑hosted).
Span topology aligned with the message lifecycle.
Accurate Metrics
Server‑side aggregation of tracing data produces OpenMetrics‑compatible metrics that integrate seamlessly with Prometheus and Grafana.
Grafana Dashboards
The provided dashboards cover overview, topic‑level send rates, consumer group performance, and more, offering richer and more precise data than the open‑source equivalents, continuously refined by the RocketMQ team.
ARMS Integration
RocketMQ’s tracing data is stored in Alibaba Cloud Log Service, then transformed into Prometheus‑compatible metrics via an ETL pipeline. ARMS creates a dedicated Prometheus instance per cloud user, delivering isolated storage, multi‑tenant dashboards, and alarm capabilities. The ARMS console integrates Grafana views and alarm rules, allowing one‑click activation of monitoring for any RocketMQ instance.
Message Backlog Diagnosis
The article explains how to interpret backlog metrics such as Ready messages and Queue time, identify root causes (consumer failures or upstream overload), and set appropriate alerts on send health, consumption latency, and related logs or traces.
Alerting and Incident Response
ARMS provides end‑to‑end alert configuration, scheduling, and handling workflows, plus intelligent noise reduction and multi‑channel notifications (e.g., DingTalk). Alerts can be linked to trace IDs and logs for rapid root‑cause analysis.
Overall, the integration demonstrates how RocketMQ leverages Prometheus, OpenTelemetry, and Alibaba Cloud ARMS to deliver a comprehensive, cloud‑native observability stack for messaging workloads.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
