Backend Development 12 min read

Pulsar vs RocketMQ: Architecture, Cost Benefits, and Migration Strategy for Xiaohongshu Online Messaging

Xiaohongshu replaced its RocketMQ‑based online messaging platform with Apache Pulsar, achieving up to 48% total cost reduction, 43% higher CPU utilization, 30% resource savings, and a latency drop from 20.2 ms to 5.7 ms through cloud‑native, elastic scaling and a phased migration strategy.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Pulsar vs RocketMQ: Architecture, Cost Benefits, and Migration Strategy for Xiaohongshu Online Messaging

This article presents a technical case study of message‑queue selection at Xiaohongshu, comparing the features of Apache Pulsar and RocketMQ (5.x) and describing how Pulsar was adopted for the company’s online messaging platform.

1. Background

Message queues (MQ) are a core component of distributed systems, providing asynchronous communication, decoupling applications, and improving availability and scalability.

2. Industry Trends

While Kafka dominates offline big‑data pipelines, online messaging with requirements such as transactional, delayed, and dead‑letter messages tends to favor RocketMQ or Pulsar.

3. Current Situation at Xiaohongshu

Red Events MQ is built on DDMQ with a Discovery + Proxy + RocketMQ engine architecture. The system consists of a control platform for topic management, service discovery, a Proxy layer that abstracts the underlying MQ, and heterogeneous client SDKs.

The existing deployment uses RocketMQ 2‑replica clusters (32 CPU + 128 GB RAM + 16 TB disk) and suffers from under‑utilized slave nodes.

4. Problems Identified

High storage cost due to replica strategy.

CPU utilization imbalance: master nodes at ~50 % while slaves at ~7 %.

Operational complexity in scaling and maintenance.

5. Evolution Roadmap

The team evaluated Pulsar versus RocketMQ and chose Pulsar for the following advantages:

Cost Reduction : Storage cost lowered by 27 % and overall cost by 48 % at current traffic; projected 10× traffic growth would further decrease costs.

CPU Utilization : By eliminating the master/slave imbalance, CPU usage can improve by up to 43 %, translating to a 21.5 % cost saving.

Elastic Scaling & Pay‑as‑You‑Go : Elastic resource allocation can save roughly 30 % of resource consumption.

Operations Friendly : Cloud‑native deployment on K8s enables seamless scaling, zero‑downtime expansion, and smooth shrink‑down without manual partition migrations.

6. Design Goals

Standardized client (Events Client) covering all languages and scenarios.

Proxy layer to hide underlying MQ engine.

Cloud‑native, compute‑storage separation architecture (Pulsar clusters + BookKeeper).

Full‑stack observability and automated fault‑tolerance.

7. Migration Path

Prioritized, gradual migration from low‑priority services to high‑priority services.

Target 100 % coverage of Pulsar, leveraging client standardization to drive adoption (initially 11 % traffic, aiming for full migration).

Resource‑efficiency improvements during migration (CPU utilization from 30 % to 50 %).

8. Results So Far

Overall cost reduced by 42 % (mainly storage).

CPU utilization increased from 34 % to 60 %.

End‑to‑end latency (P99) dropped from 20.2 ms to 5.7 ms.

Operational workload decreased thanks to cloud‑native automation.

The article also includes architectural diagrams of Pulsar, RocketMQ 5.0, and the proposed cloud‑native design, as well as references to official documentation and upcoming meetups.

Cloud Nativescalabilitycost-optimizationMessage QueuerocketmqApache Pulsar
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.