Why ByteDance Chose RocketMQ: Architecture, Proxy Design, and Disaster Recovery

This article explains ByteDance's shift to RocketMQ, detailing the business drivers, technical advantages, proxy layer implementation for microservices, encountered challenges, and the disaster‑recovery strategies adopted to ensure high availability and performance.

dbaplus Community
dbaplus Community
dbaplus Community
Why ByteDance Chose RocketMQ: Architecture, Proxy Design, and Disaster Recovery

Business Background

ByteDance's services rely heavily on micro‑services with a massive number of containers, topics, and heterogeneous languages (Python, Go, C++, Java, JS). Maintaining SDKs for various message‑queue clients was costly, and the previous queues (NSQ and Kafka) showed limitations in persistence, CPU usage, and latency under high load.

Why RocketMQ Was Chosen

RocketMQ, proven by Alibaba during Double‑11, offers high reliability, data persistence, and multi‑replica storage similar to Kafka but with an append‑only commit‑log that handles massive topics efficiently. Benchmarks showed a single node achieving 140k QPS with latency under 2 ms, outperforming NSQ and Kafka in multi‑topic scenarios. It also supports retries, concurrent consumption, dead‑letter queues, delayed messages, timestamp‑based back‑tracking, message headers for tracing, and transactional messages.

RocketMQ Deployment at ByteDance

The deployment diagram (see image) illustrates a proxy layer placed between producers and brokers. The proxy, built on gRPC (or Thrift), abstracts the underlying MQ, keeping client SDKs lightweight and reducing upgrade burdens. Producers send messages to a proxy instance discovered via service discovery; the proxy forwards them to the appropriate broker cluster. Consumer proxies handle pull logic, rebalance, and cache messages to lessen broker page‑cache pressure. This architecture resembles Didi's MQ design.

Why a Proxy Is Needed in Container & Micro‑service Environments

SDK remains simple and lightweight.

Traffic can be controlled centrally, enabling flow throttling or redirection to other data centers without SDK changes.

Connection explosion is avoided, especially for Python services that spawn many processes; a proxy reduces the number of broker connections.

Higher concurrency is achievable because consumers can be scaled beyond the number of broker queues.

The proxy layer allows seamless migration to other MQ engines without client changes.

Rebalance frequency is reduced since it occurs only between proxy and broker, not directly with consumers.

Drawbacks of Adding a Proxy

Additional CPU overhead for RPC serialization/deserialization.

Latency increase from ~2 ms to ~3 ms under 4K‑msg, 200k TPS load, which remains acceptable.

Problems Encountered During Integration

Message size limit : RPC adds a request‑size ceiling; RocketMQ’s default max message size is 4 MB, requiring configuration changes on both producer and broker.

Multiple connections : By default, producers share a single MQ client socket; isolation can be achieved via setInstanceName or the new 4.5.0 API to create multiple client instances.

Timeout settings : The extra RPC layer introduces an RTT that must be accounted for in SDK timeout semantics.

Choosing a rebalance algorithm : In dual‑data‑center deployments, default average‑distribution can assign consumers to empty queues; a machine‑room‑aware algorithm was adopted to prefer local queues.

Proxy cache for pull : Pulling messages via slaves reduces master I/O, but cache size must be tuned to avoid OOM.

End‑to‑end compression : Messages >4 KB are compressed by the producer; performing compression at the proxy would overload it, so compression stays on the client side.

Disaster‑Recovery System Construction

Four architectural options were evaluated:

Option 1 : Master‑slave across data centers with proxy‑driven traffic scheduling.

Option 2 : Single‑master with MySQL‑style master/slave replication and a Mirror‑Maker‑like component for message duplication.

Option 3 : Dual‑write with bidirectional replication, adding complexity and data inconsistency risks.

Option 4 : Dual‑write without Mirror, chosen as the final solution. Ordered messages are routed to a designated primary data center, while unordered messages are written locally. Consumers pull from the nearest data center, and a simple scheduling logic balances load.

The chosen design simplifies operations, tolerates a single data‑center failure for unordered messages, and leverages the proxy layer for seamless traffic control.

Overall, the integration of RocketMQ with a custom proxy layer enabled ByteDance to achieve high‑throughput, low‑latency messaging across a massive micro‑service ecosystem while providing flexible disaster‑recovery capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Proxydisaster recoveryRocketMQ
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.