How We Built a Dual‑Center, High‑Availability RocketMQ Platform
This article explains why RocketMQ was chosen, describes its large‑scale usage, details the design and implementation of a same‑city dual‑center architecture with near‑by production and consumption, outlines failover mechanisms, governance practices, lessons learned, and future plans for the messaging platform.
Background
RocketMQ was chosen over other mature MQ solutions (RabbitMQ, ActiveMQ, NSQ, Kafka) because it is a pure Java implementation with no external dependencies, demonstrated high performance and stability during Alibaba’s Double‑11 event, provides practical messaging features (synchronous, asynchronous, one‑way, delayed sending; message reset, retry queues, dead‑letter queues), and benefits from an active community for rapid issue resolution.
Usage
Primarily used for peak‑shaving, decoupling, and asynchronous processing.
Deployed in core services such as train‑ticket, flight‑ticket, and hotel booking systems, handling massive traffic from the WeChat entry point.
Extensively used in payment, order, ticketing, and data‑synchronization workflows.
Processes over 1000 billion messages per day .
Dual‑Center Transformation
A single‑data‑center network failure previously caused significant business impact, prompting the introduction of a same‑city dual‑center architecture to guarantee high availability.
Why Dual‑Center
Ensures business continuity when a single data center fails.
Protects data reliability; a single‑center failure could lead to data loss.
Enables horizontal scaling by distributing traffic across multiple data centers.
Dual‑Center Options
Two schemes were evaluated:
Cold (hot) backup: Two independent RocketMQ clusters; the primary cluster writes data and synchronizes it in real time to a standby cluster using the RocketMQ Replicator. Metadata such as topics, consumer groups, and offsets must be periodically synced.
Active‑active: Two independent clusters receive traffic in their respective data centers without data synchronization. If one data center fails, traffic is switched to the other, and messages continue to be produced there.
Domain‑name handling was a key concern. Using a single domain that resolves dynamically to each data center requires producers and consumers to be co‑located, which is impractical. Assigning separate domains per cluster would require extensive client changes. The final solution adopts a single **Global MQ** cluster spanning both data centers, preserving the original domain name and requiring only a client SDK upgrade.
Near‑by Principle
To minimize latency, producers and consumers should operate in the same Internet Data Center (IDC). Two methods determine the IDC of a node:
IP lookup: At startup, a node obtains its own IP and queries an internal service that maps the IP to an IDC.
Environment awareness: During provisioning, metadata (e.g., a logicIdcUK file) records the IDC identifier. The node reads this file at startup, avoiding external dependencies.
Production & Consumption Logic
Production: Prefer the local IDC broker; if the local broker is unavailable, fall back to the remote IDC broker.
Consumption: An IDC‑aware queue allocation algorithm distributes messages evenly among consumers in the same IDC. If an IDC has no consumers, its share is redistributed to other IDC consumers.
Pseudocode for the allocation logic:
Map<String, Set<MQ>> mqs = classifyMQByIdc(mqAll);
Map<String, Set<ClientId>> cids = classifyCidByIdc(cidAll);
Set<MQ> result = new HashSet<>();
for (String idc : mqs.keySet()) {
result.addAll(allocateMQAveragely(mqs.get(idc), cids.get(idc), currentClientId));
}Two consumption deployment models exist:
Single‑side deployment: Consumers pull all messages from every IDC.
Dual‑side deployment: Consumers pull only messages from their own IDC; capacity planning must ensure sufficient consumer instances per IDC.
Failure Handling
Each broker group follows a “one master, two slaves” pattern: one master and one slave reside in one IDC, the second slave resides in the other IDC. A write is considered successful once a majority of slaves acknowledge the message.
If a master fails, a Nameserver (metadata service) detects the fault and initiates failover using Raft for leader election among three Nameserver nodes (two IDC‑based and one cloud‑based). The elected leader selects the most up‑to‑date slave as the new master, registers it in the metadata system, and blocks the old master from receiving new messages.
Failover Drills
Operational drills shift all traffic to the secondary data center, then return to the dual‑center setup, and finally back to the primary center. This verifies that each center can independently handle the full user load.
MQ Platform Governance
Even a high‑performance, highly available system can incur maintenance overhead and operational risk if used improperly. Governance measures are therefore essential.
Governance Areas
Topic/Consumer‑Group Management: Automatic creation is disabled; usage must be requested and recorded to enable rapid owner identification.
Production Throttling: Flow‑control limits the production speed of each topic to prevent resource contention.
Message Backlog Monitoring: Thresholds for backlog size and age trigger alerts to the responsible team.
Consumer Node Health: Offline or unresponsive consumer nodes generate notifications.
Client Latency Monitoring: Send/consume latency and message size (>10 KB) are tracked; large messages prompt compression or redesign.
Message Traceability: IP, timestamps, and server‑side statistics are recorded to reconstruct a message’s lifecycle via msgId or custom keys.
SDK Version Checks: SDK versions are reported periodically; outdated or risky versions are flagged for upgrade.
Cluster Health Inspection: Periodic checks of node count, write/read TPS, and simulated traffic assess overall health.
Cluster Performance Inspection: Processing‑time distribution is analyzed; prolonged latency triggers hardware‑level alerts (disk I/O, CPU load, etc.).
High Availability: Master failures automatically promote a slave to master, preserving message order and cluster integrity.
Lessons Learned
Compatibility issues arose when new and old consumer versions co‑existed; a unified queue allocation algorithm resolved this.
Large numbers of topics and consumer groups caused registration latency and OOM errors; compression and upstream community fixes mitigated the problem.
Inconsistent topic‑length validation led to message loss on restart; the issue has been fixed upstream.
Broker processes could hang on CentOS 6.6; upgrading the OS eliminated the problem.
Future Outlook
Current message retention is short, hindering troubleshooting and data forecasting. Planned improvements include:
Archiving historical messages for long‑term analysis.
Separating storage from compute to enable scalable analytics.
Leveraging archived data for advanced predictive models.
Upgrading the server to DLedger to guarantee strict message consistency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
