RocketMQ Practices and Disaster‑Recovery Architecture at ByteDance
This article summarizes Shen Hui’s presentation on how ByteDance adopted RocketMQ in a massive micro‑service environment, detailing the business background, reasons for choosing RocketMQ, the proxy‑based deployment, encountered challenges, and the multi‑data‑center disaster‑recovery solutions implemented.
This article compiles Shen Hui’s talk at the RocketMQ developer salon, where he shares ByteDance’s practical experience with RocketMQ in a micro‑service architecture and its disaster‑recovery system.
ByteDance runs a huge number of micro‑services, containers, and topics across many languages (Python, Go, C++, Java, JS), making SDK maintenance costly.
Before RocketMQ, they used NSQ (in‑memory, high CPU) and Kafka (high throughput but performance drops with many topics). RocketMQ was chosen for its high reliability, data persistence, multi‑replica storage, append‑only commitlog, and strong performance (up to 140k QPS with latency <2 ms). It also offers features such as retry, concurrent consumption, DLQ, delayed messages, message tracing via headers, and transactional messages.
ByteDance’s deployment introduces a Proxy layer (implemented with gRPC, also possible with Thrift) to keep client SDKs lightweight. Producers send messages to a Proxy, which forwards them to the appropriate Broker cluster; Consumers interact with a Consumer Proxy that handles pull, rebalancing, and caching. This design reduces SDK complexity, enables traffic control, lowers connection counts, improves consumer concurrency, and allows seamless MQ engine swaps, at the cost of some CPU overhead and a slight latency increase (~1 ms).
The Consumer side uses multiple Proxies and assigns queues based on ordered or unordered messages. Ordered messages are routed to a primary data‑center, while unordered messages can be consumed locally, allowing the consumer instance count to exceed the number of queues.
During implementation, several issues were encountered: message size limits (default 4 MB), connection pooling, RPC timeout handling, choosing an appropriate rebalance algorithm for dual‑data‑center deployments, cache design for pulling messages, and end‑to‑end compression (large messages are compressed on the client side to avoid overloading the Proxy).
For disaster recovery, three schemes were evaluated: (1) master‑slave clusters across data centers with traffic routing via the Proxy; (2) single‑master with mirror‑maker style replication; (3) dual‑write with bidirectional replication. The final solution combines dual‑write without a mirror, routing ordered messages to a primary data‑center and unordered messages to the nearest one, balancing simplicity and reliability.
The presentation concludes with a summary of the architecture and acknowledges the author, Shen Hui, a ByteDance infrastructure engineer.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
