How We Replaced RabbitMQ with RocketMQ for a High‑Performance, Highly‑Available Messaging Platform
This article details the challenges of scaling RabbitMQ, the evaluation of RocketMQ versus Pulsar, the design of a new messaging middleware platform with high availability, performance, and rich features, and the step‑by‑step migration strategy that enabled seamless, low‑cost transition for massive business traffic.
Background
Vivo’s Internet Middleware team built a high‑availability RabbitMQ platform in 2016. Rapid business growth exposed three critical limitations:
High‑availability : the cluster could split‑brain and required manual intervention to recover, risking data loss.
Performance : each queue was bound to a single broker node, causing bottlenecks under high concurrency. A three‑node RabbitMQ cluster could only sustain a few × 10⁴ TPS for 1 KB messages, and massive message back‑log degraded throughput and recovery time.
Feature set : no support for transactional or ordered messages, limited dead‑letter handling, no built‑in message tracing, and immediate redelivery of failed messages could block downstream consumption.
Project Goals
Business requirements : ultra‑high throughput (≥100 k TPS for 1 KB payloads), 99.99% service availability, 99.99999999% data reliability, and rich messaging capabilities (clustering, broadcast, transaction, ordering, delayed delivery, dead‑letter, tracing).
Operational requirements : fine‑grained permission control, traffic isolation, comprehensive observability, extensibility for future cloud‑native deployment, and seamless migration from the existing RabbitMQ platform.
Open‑Source Component Selection
The team evaluated RocketMQ and Apache Pulsar on four dimensions.
High‑availability architecture : Pulsar separates compute (Broker) and storage (BookKeeper), enabling fast failover via ZooKeeper‑managed leader election. RocketMQ relies on master‑slave replication; automatic failover requires custom development.
Scaling & fault recovery : Pulsar allows independent scaling of Brokers and ZooKeeper with automatic load‑balancing; recovery is measured in seconds. RocketMQ needs manual topic rebalancing after scaling, and failover depends on the master‑slave switch (30‑60 s) unless client‑side strategies are added.
Performance : Pulsar can handle millions of topics (limited by ZooKeeper metadata) and achieves ~100 k TPS for 1 KB messages. RocketMQ also supports millions of topics theoretically, but practical limits suggest ≤5 × 10⁴ topics per cluster; it reaches >100 k TPS for 1 KB messages in internal tests.
Migration Construction
To achieve a transparent migration from RabbitMQ to RocketMQ, five technical tasks were defined:
Deploy an AMQP‑proxy gateway that translates RabbitMQ’s AMQP protocol to RocketMQ’s native protocol.
Define and maintain a metadata mapping layer that reconciles RabbitMQ’s exchange/queue semantics with RocketMQ’s topic/consumer‑group model.
Implement a high‑performance, non‑interfering message push mechanism using a semaphore‑driven thread pool. The pool limits the number of concurrent push threads while allowing each thread to service thousands of queues, preventing a “hot queue” from starving others.
Add consumption start/stop controls to pause or resume delivery globally or per node.
Introduce global rate‑limiting to bound aggregate consumption speed and protect downstream services.
The chosen push strategy works as follows:
Each queue is represented by a ConsumeMessageService instance that holds a blocking queue of pending messages.
A semaphore tracks the total number of messages that may be pushed concurrently; when a client acknowledges a message, the semaphore is released, allowing the next batch to be dispatched.
Push threads poll all ConsumeMessageService instances, submit tasks to the thread pool only if both the local buffer and the client have capacity, thereby guaranteeing fairness across queues.
Platform Architecture
The final architecture consists of:
AMQP‑proxy gateway : entry point for existing RabbitMQ clients.
mq‑meta service : stores and serves the RabbitMQ‑to‑RocketMQ metadata mapping.
mq‑controller : manages master‑slave switching and exposes HA controls.
Monitoring & load‑balancing modules : provide health metrics, automatic broker rebalancing, and traffic isolation.
All components are containerizable, paving the way for future cloud‑native deployment.
Migration Outcomes
Message throughput rose from tens of thousands TPS to >100 k TPS for 1 KB payloads.
Resource consumption dropped >50%, reducing CPU, memory, and storage footprints.
Retention policy changed to a default 3‑7 day window (configurable per environment).
Message size limit set to 256 KB to prevent oversized payloads.
Global and per‑node consumption pause, as well as configurable rate‑limiting, were introduced.
New features such as unified expiration, gradient retry, broadcast consumption, and full‑stack message tracing became available.
Future Outlook
Leverage the AMQP‑proxy gateway to implement advanced governance (e.g., traffic shaping, quota enforcement).
Transition to a gRPC‑based queue engine service that abstracts the underlying middleware, allowing applications to switch implementations without code changes.
Investigate RocketMQ 5.0’s compute‑storage separation architecture for the next generation of the platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
