How ByteDance Scales RocketMQ for Microservices and Disaster Recovery
This article details ByteDance’s adoption of RocketMQ within its massive microservice ecosystem, explaining the business challenges, reasons for choosing RocketMQ, the proxy‑based deployment architecture, encountered technical issues, and the multi‑datacenter disaster‑recovery strategies implemented to ensure high availability and performance.
Business Background
ByteDance’s services heavily rely on microservices, with a huge number of containers and topics across many languages (Python, Go, C++, Java, JS). Maintaining SDKs for basic components incurs high cost.
Why Choose RocketMQ
RocketMQ, validated by Alibaba during Double‑11, offers high reliability, data persistence with multi‑replica commitlog, and excellent performance (single‑node QPS up to 140k, latency <2 ms). It supports retries, concurrent consumption, dead‑letter queues, delayed messages, timestamp rollback, message headers for tracing, and transactional messages—features lacking or inefficient in NSQ and Kafka.
RocketMQ Deployment at ByteDance
Architecture includes a lightweight Proxy layer (implemented with gRPC, also compatible with Thrift) that abstracts the client. Producers send messages to the Proxy, which forwards them to the appropriate Broker cluster. Consumers use a Consumer Proxy that handles pull, re‑balance, and caches messages to reduce broker page‑cache pollution. This design mirrors Didi’s MQ architecture and simplifies SDK upgrades.
Why Use a Proxy in Container/Microservice Scenarios
SDK remains lightweight.
Traffic can be controlled centrally (e.g., throttling overloaded brokers or redirecting traffic to another data center).
Reduces connection explosion for languages like Python that spawn many processes.
Enables higher consumer concurrency by decoupling consumer instances from broker queues.
Allows seamless swapping of underlying MQ engines without client changes.
Minimizes rebalance frequency during consumer restarts, reducing message backlog.
The proxy adds CPU overhead for RPC serialization and a slight latency increase (≈1 ms, from 2 ms to 3 ms), which is acceptable for the workload.
Consumer Logic
Consumers are assigned queues by the Proxy. For ordered messages, each Proxy maintains a static queue assignment; for unordered messages, the Proxy aggregates queues and dispatches them to consumers, allowing virtually unlimited consumer scaling. In container environments, multiple consumer instances are launched based on CPU cores (e.g., 8C → 8 threads).
Challenges Encountered
Message size limit: RPC adds request‑size constraints; RocketMQ’s default max message size is 4 MB, requiring broker and producer configuration changes for larger payloads.
Multiple connections: Producers share a single MQ client instance; additional sockets can be created via setInstanceName (added in 4.5.0).
Timeout settings: RPC introduces its own RTT and timeout semantics that must be propagated to the SDK.
Rebalance algorithm selection for cross‑datacenter deployments to avoid assigning empty queues.
Message pull caching to prevent OOM when only counting message numbers.
End‑to‑end compression: Large messages (>4 KB) are compressed on the producer side; performing compression in the Proxy would add undue load.
ByteDance Disaster Recovery System Construction
Three primary schemes were evaluated:
Expanded cluster with master‑slave across data centers, using Proxy for traffic steering.
Single‑master with mirror‑maker style replication, similar to MySQL/Redis master‑slave.
Dual‑write with bidirectional replication, which proved complex and introduced data inconsistency.
The final chosen approach combines dual‑write without a mirror. Ordered messages are routed to a primary data center via Proxy, while unordered messages are written to the nearest data center. The Proxy fetches queue information from NameServer, and consumers pull from the local data center, ensuring simplicity and high availability. Independent clusters can also be deployed for fully isolated workloads.
Overall, the integration of RocketMQ with a Proxy layer and the described disaster‑recovery strategies enable ByteDance to handle massive microservice traffic with high reliability and performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
