Operations 14 min read

How ByteDance Scaled RocketMQ with Proxies for High‑Performance Messaging and Disaster Recovery

This article details ByteDance's (今日头条) migration to RocketMQ, explaining the business drivers, the reasons for choosing RocketMQ, the proxy‑based architecture for producers and consumers, performance gains, operational challenges, and the disaster‑recovery designs implemented to ensure high availability in a micro‑service environment.

21CTO
21CTO
21CTO
How ByteDance Scaled RocketMQ with Proxies for High‑Performance Messaging and Disaster Recovery

1. Business Background

ByteDance's services heavily use micro‑services, resulting in a massive number of containers, topics, and diverse programming languages (Python, Go, C++, Java, JS). Maintaining SDKs for basic components was costly, and the previous queues (NSQ and Kafka) showed limitations in persistence, CPU usage, and latency under high concurrency.

2. Why Choose RocketMQ

RocketMQ, an open‑source message queue validated by Alibaba during Double‑11, offers high reliability, data persistence, multi‑replica storage, and an append‑only commitlog that efficiently handles massive topics with stable write latency. Benchmarks showed a single‑node throughput of 140k QPS and latency under 2 ms, outperforming NSQ and Kafka in multi‑topic scenarios.

Additional features such as retry, concurrent consumption, dead‑letter queues, delayed messages, timestamp‑based backtracking, message headers for tracing, and transactional messages further convinced the team to adopt RocketMQ.

3. RocketMQ Implementation at ByteDance

The deployment architecture includes a lightweight Proxy layer between producers/consumers and brokers, implemented via gRPC (or Thrift). Producers send messages to a Proxy, which forwards them to the appropriate broker cluster, keeping SDKs simple and reducing upgrade overhead.

Consumers use a Proxy that pulls messages, caches them, and performs a second rebalance, reducing broker page‑cache pollution. This design mirrors Didi's MQ architecture and simplifies scaling, traffic control, and connection management.

4. Why Use Proxy in Container/Micro‑service Scenarios

Key reasons for the Proxy layer:

SDK remains lightweight.

Traffic can be controlled centrally, enabling flow throttling or redirection across data centers without SDK changes.

Connection explosion is mitigated, especially for Python services that spawn many processes in containers.

Higher consumer concurrency is achievable because the Proxy handles message distribution.

Future storage engine changes are transparent to clients.

Rebalance frequency is reduced, as it occurs only between Proxy and broker.

However, the Proxy adds CPU overhead for RPC serialization/deserialization and a slight latency increase (≈1 ms).

5. Disaster Recovery System Construction

Four disaster‑recovery schemes were evaluated:

Cluster expansion with master‑slave across data centers, leveraging Proxy for traffic steering.

Single‑master mode with MySQL‑style replication and a Mirror‑maker‑like component for message copying.

Bidirectional replication with duplicate writes, requiring header flags to avoid loops.

Dual‑write without mirroring, isolating clusters per data center.

The final choice combines dual‑write isolation for ordered messages (routed to a primary data center) and near‑write for unordered messages, using Proxy to fetch broker queue info and route accordingly. This approach balances simplicity, high availability, and operational overhead.

Overall, the proxy‑enhanced RocketMQ deployment provides ByteDance with a scalable, low‑latency messaging backbone while supporting robust disaster‑recovery strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Message QueueRocketMQhigh performanceProxy Architecture
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.