RocketMQ Practices and Disaster‑Recovery Architecture at ByteDance

This article summarizes Shen Hui’s presentation on how ByteDance adopted RocketMQ in a massive micro‑service environment, detailing the business background, reasons for choosing RocketMQ, the proxy‑based deployment, encountered challenges, and the multi‑data‑center disaster‑recovery solutions implemented.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
RocketMQ Practices and Disaster‑Recovery Architecture at ByteDance

This article compiles Shen Hui’s talk at the RocketMQ developer salon, where he shares ByteDance’s practical experience with RocketMQ in a micro‑service architecture and its disaster‑recovery system.

ByteDance runs a huge number of micro‑services, containers, and topics across many languages (Python, Go, C++, Java, JS), making SDK maintenance costly.

Before RocketMQ, they used NSQ (in‑memory, high CPU) and Kafka (high throughput but performance drops with many topics). RocketMQ was chosen for its high reliability, data persistence, multi‑replica storage, append‑only commitlog, and strong performance (up to 140k QPS with latency <2 ms). It also offers features such as retry, concurrent consumption, DLQ, delayed messages, message tracing via headers, and transactional messages.

ByteDance’s deployment introduces a Proxy layer (implemented with gRPC, also possible with Thrift) to keep client SDKs lightweight. Producers send messages to a Proxy, which forwards them to the appropriate Broker cluster; Consumers interact with a Consumer Proxy that handles pull, rebalancing, and caching. This design reduces SDK complexity, enables traffic control, lowers connection counts, improves consumer concurrency, and allows seamless MQ engine swaps, at the cost of some CPU overhead and a slight latency increase (~1 ms).

The Consumer side uses multiple Proxies and assigns queues based on ordered or unordered messages. Ordered messages are routed to a primary data‑center, while unordered messages can be consumed locally, allowing the consumer instance count to exceed the number of queues.

During implementation, several issues were encountered: message size limits (default 4 MB), connection pooling, RPC timeout handling, choosing an appropriate rebalance algorithm for dual‑data‑center deployments, cache design for pulling messages, and end‑to‑end compression (large messages are compressed on the client side to avoid overloading the Proxy).

For disaster recovery, three schemes were evaluated: (1) master‑slave clusters across data centers with traffic routing via the Proxy; (2) single‑master with mirror‑maker style replication; (3) dual‑write with bidirectional replication. The final solution combines dual‑write without a mirror, routing ordered messages to a primary data‑center and unordered messages to the nearest one, balancing simplicity and reliability.

The presentation concludes with a summary of the architecture and acknowledges the author, Shen Hui, a ByteDance infrastructure engineer.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

disaster recoveryMessage QueueRocketMQProxy Architecture
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.