How We Replaced RabbitMQ with RocketMQ for a High‑Performance, Highly‑Available Messaging Platform

This article details the challenges of scaling RabbitMQ, the evaluation of RocketMQ versus Pulsar, the design of a new messaging middleware platform with high availability, performance, and rich features, and the step‑by‑step migration strategy that enabled seamless, low‑cost transition for massive business traffic.

dbaplus Community
dbaplus Community
dbaplus Community
How We Replaced RabbitMQ with RocketMQ for a High‑Performance, Highly‑Available Messaging Platform

Background

Vivo’s Internet Middleware team built a high‑availability RabbitMQ platform in 2016. Rapid business growth exposed three critical limitations:

High‑availability : the cluster could split‑brain and required manual intervention to recover, risking data loss.

Performance : each queue was bound to a single broker node, causing bottlenecks under high concurrency. A three‑node RabbitMQ cluster could only sustain a few × 10⁴ TPS for 1 KB messages, and massive message back‑log degraded throughput and recovery time.

Feature set : no support for transactional or ordered messages, limited dead‑letter handling, no built‑in message tracing, and immediate redelivery of failed messages could block downstream consumption.

Project Goals

Business requirements : ultra‑high throughput (≥100 k TPS for 1 KB payloads), 99.99% service availability, 99.99999999% data reliability, and rich messaging capabilities (clustering, broadcast, transaction, ordering, delayed delivery, dead‑letter, tracing).

Operational requirements : fine‑grained permission control, traffic isolation, comprehensive observability, extensibility for future cloud‑native deployment, and seamless migration from the existing RabbitMQ platform.

Open‑Source Component Selection

The team evaluated RocketMQ and Apache Pulsar on four dimensions.

High‑availability architecture : Pulsar separates compute (Broker) and storage (BookKeeper), enabling fast failover via ZooKeeper‑managed leader election. RocketMQ relies on master‑slave replication; automatic failover requires custom development.

Scaling & fault recovery : Pulsar allows independent scaling of Brokers and ZooKeeper with automatic load‑balancing; recovery is measured in seconds. RocketMQ needs manual topic rebalancing after scaling, and failover depends on the master‑slave switch (30‑60 s) unless client‑side strategies are added.

Performance : Pulsar can handle millions of topics (limited by ZooKeeper metadata) and achieves ~100 k TPS for 1 KB messages. RocketMQ also supports millions of topics theoretically, but practical limits suggest ≤5 × 10⁴ topics per cluster; it reaches >100 k TPS for 1 KB messages in internal tests.

Migration Construction

To achieve a transparent migration from RabbitMQ to RocketMQ, five technical tasks were defined:

Deploy an AMQP‑proxy gateway that translates RabbitMQ’s AMQP protocol to RocketMQ’s native protocol.

Define and maintain a metadata mapping layer that reconciles RabbitMQ’s exchange/queue semantics with RocketMQ’s topic/consumer‑group model.

Implement a high‑performance, non‑interfering message push mechanism using a semaphore‑driven thread pool. The pool limits the number of concurrent push threads while allowing each thread to service thousands of queues, preventing a “hot queue” from starving others.

Add consumption start/stop controls to pause or resume delivery globally or per node.

Introduce global rate‑limiting to bound aggregate consumption speed and protect downstream services.

The chosen push strategy works as follows:

Each queue is represented by a ConsumeMessageService instance that holds a blocking queue of pending messages.

A semaphore tracks the total number of messages that may be pushed concurrently; when a client acknowledges a message, the semaphore is released, allowing the next batch to be dispatched.

Push threads poll all ConsumeMessageService instances, submit tasks to the thread pool only if both the local buffer and the client have capacity, thereby guaranteeing fairness across queues.

Platform Architecture

The final architecture consists of:

AMQP‑proxy gateway : entry point for existing RabbitMQ clients.

mq‑meta service : stores and serves the RabbitMQ‑to‑RocketMQ metadata mapping.

mq‑controller : manages master‑slave switching and exposes HA controls.

Monitoring & load‑balancing modules : provide health metrics, automatic broker rebalancing, and traffic isolation.

All components are containerizable, paving the way for future cloud‑native deployment.

Migration Outcomes

Message throughput rose from tens of thousands TPS to >100 k TPS for 1 KB payloads.

Resource consumption dropped >50%, reducing CPU, memory, and storage footprints.

Retention policy changed to a default 3‑7 day window (configurable per environment).

Message size limit set to 256 KB to prevent oversized payloads.

Global and per‑node consumption pause, as well as configurable rate‑limiting, were introduced.

New features such as unified expiration, gradient retry, broadcast consumption, and full‑stack message tracing became available.

Future Outlook

Leverage the AMQP‑proxy gateway to implement advanced governance (e.g., traffic shaping, quota enforcement).

Transition to a gRPC‑based queue engine service that abstracts the underlying middleware, allowing applications to switch implementations without code changes.

Investigate RocketMQ 5.0’s compute‑storage separation architecture for the next generation of the platform.

Pulsar deployment architecture
Pulsar deployment architecture
RocketMQ deployment architecture
RocketMQ deployment architecture
Message push flow diagram
Message push flow diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

migrationperformancemiddlewareRabbitMQRocketMQMessaging
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.