How NetEase Cloud Music Built a Custom High‑Availability Message Queue on RocketMQ

This article details NetEase Cloud Music's journey from evaluating RabbitMQ, Kafka, and RocketMQ to designing a fully controllable, high‑availability message queue with failover, tracing, monitoring, and numerous custom extensions that now serve hundreds of services and billions of messages daily.

dbaplus Community
dbaplus Community
dbaplus Community
How NetEase Cloud Music Built a Custom High‑Availability Message Queue on RocketMQ

Background

NetEase Cloud Music launched in April 2013 and quickly grew to a large user base. The backend migrated from a traditional Tomcat cluster to a distributed micro‑service architecture, introducing RPC, API gateway, tracing, and moving the message queue from RabbitMQ to Kafka.

Why a Custom Queue Was Needed

Existing queues (RabbitMQ, Kafka, ActiveMQ) could not satisfy the business requirements:

RabbitMQ persisted throughput ≈26 k msgs/s – far below the required QPS.

Kafka lacked dead‑letter handling, automatic retry, transactional messages, scheduled messages, filtering, broadcast, and synchronous flush.

ActiveMQ performance was also insufficient.

RocketMQ was selected as the base, but the open‑source version missed several critical features.

Limitations of Open‑Source RocketMQ

Broker only supports Master‑to‑Slave replication; no automatic failover.

Transactional message support is not open‑source.

No built‑in message tracing.

Missing alarm/monitoring framework.

Console UI is incomplete.

Business Requirements

The queue had to provide:

High availability with fast master‑slave failover.

Different QoS per scenario (e.g., loss‑less delivery for e‑commerce, transaction support).

End‑to‑end send/consume tracing and failure diagnostics.

Latency, error, and resource‑level monitoring with alerting.

HTTP access for Node.js/Python services.

Architecture Design

The solution retains the core of open‑source RocketMQ and adds several components:

Failover component : Periodically health‑checks brokers; on failure it promotes a slave to master within seconds.

Downgrade component : When a send fails, the client redirects the message to a standby disaster‑recovery cluster.

Monitor component : Client latency and exception data are wrapped into system messages, consumed by Monitor, stored in Elasticsearch, and used for alarm generation.

Inspection component : Regularly checks key broker and NameServer statuses and pushes alerts.

Metrics pipeline : Metrics are written to InfluxDB and visualized via Grafana dashboards.

Console : Provides resource management, message query, trace view, monitoring reports, alarm configuration, and dead‑letter re‑push.

NameServer discovery : Nginx hot‑doc serves as a dynamic discovery layer for NameServer addresses.

Overall architecture diagram:

Architecture diagram
Architecture diagram

Interaction Flow

NameServer supplies topic routing and configuration.

Broker stores, indexes, and serves messages; exposes HA APIs (replication, role switch, health check) and registers with NameServer.

Nginx hot‑doc provides dynamic NameServer discovery without redeploy.

Downgrade component intercepts send failures and forwards messages to a disaster‑recovery cluster.

Failover component monitors broker health and triggers master‑slave promotion when needed.

Console enables resource control, message query, trace inspection, monitoring dashboards, alarm rules, and dead‑letter handling.

Inspection component runs periodic health checks and notifies operators on anomalies.

Monitoring component aggregates metrics for Grafana dashboards.

Alarm component evaluates thresholds and sends notifications to configured recipients.

Dashboard visualizes cluster health, replication status, QPS, latency, and error rates.

Advanced Features Implemented

Message Tracing

Extended the open‑source 4.4 trace mechanism to include send/consume timestamps, transaction rollback information, and failure exception details, enabling full‑stack debugging.

Message trace
Message trace

Transactional Messages

Implemented a fully open‑source transactional message protocol with complete trace support, matching the functionality of the commercial version.

Multi‑Environment Isolation

Tagged RPC traffic and messages with environment identifiers, allowing independent development, testing, and production pipelines without cross‑environment interference.

Environment isolation
Environment isolation

Custom Consumer Thread Pools

Each topic can now specify its own listener and dedicated ExecutorService, preventing high‑priority topics from being throttled by low‑priority ones.

Custom consumer thread pool
Custom consumer thread pool

Consumer Rate Limiting & Downgrade

Implemented broker‑side QPS control with dynamic adjustment via the monitoring UI. When the limit is reached, the broker throttles consumption instead of failing the pull request, preserving message order and avoiding unnecessary retries.

Consumer rate limiting
Consumer rate limiting

Network Concurrency Bug Fix

Identified and fixed a race condition in the client remoting layer that caused frequent connection timeouts in cross‑region deployments. The fix ensures that once a connection is established it is not inadvertently closed by concurrent threads.

Network bug fix
Network bug fix

Production Deployment

Six RocketMQ clusters are in production, serving both ordered and normal messages with high reliability. Over 200 applications depend on the queue, reaching peak QPS >300 k and managing >800 topics (some test clusters exceed 4 000 topics).

Since November 2018, Kafka has been prohibited for online services; >70 % of business traffic has migrated to the custom RocketMQ solution, markedly improving system stability.

Daily message volume exceeds 800 million, feeding more than 60 Flink jobs for downstream analytics.

Future Plans

Extend the queue to support log transport, integrating monitoring and alerting into big‑data pipelines.

Add routing capabilities to reduce cross‑region traffic.

Promote the solution to other NetEase products and collaborate with the Hangzhou research team for broader adoption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsmonitoringMessage QueueRocketMQ
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.