Operations 17 min read

Didi's Message Queue Architecture, Migration Strategies, and RocketMQ Operational Practices

At Didi, the team replaced a chaotic mix of Kafka, Redis, and other queues with a custom, RocketMQ‑based service, using dual‑write and dual‑read migration, extensive performance testing, custom failover, batch extensions, and operational tweaks to achieve stable high‑throughput, low‑latency messaging at massive scale.

Didi Tech
Didi Tech
Didi Tech
Didi's Message Queue Architecture, Migration Strategies, and RocketMQ Operational Practices

This article is based on a talk by Jiang Haiting, the head of Didi's message queue team and an Apache RocketMQ contributor, at the Apache RocketMQ Developer Salon in Beijing.

1. Thoughts on message‑queue technology selection; 2. Why RocketMQ was chosen for Didi’s business; 3. How to build a custom message‑queue service; 4. Practical extensions and refactoring on RocketMQ; 5. Experience of using RocketMQ in production.

1. History and Motivation

Initially Didi used many different queue solutions (Kafka, RocketMQ, Redis list, beanstalkd) without a dedicated team, which caused chaos and resource waste.

Kafka 0.8.2 showed severe write‑latency spikes and throughput degradation as topic count grew, mainly due to a bug in the 0.8.2 version and the use of mechanical disks.

These problems motivated Didi to build its own message‑queue service.

2. Why RocketMQ

After extensive research and testing, RocketMQ was selected for its stability, scalability, and better performance in both throughput and latency tests compared with Kafka.

2.1 Topic‑support test

Test environment: Kafka 0.8.2, RocketMQ 3.4.6, 1 Gbps network, 16 threads.

Results showed that RocketMQ’s throughput remained stable when the number of topics increased, while Kafka’s throughput dropped sharply after 256 topics, especially for large messages (2 KB).

2.2 Latency test

Both systems were tested with different ack settings and message sizes. RocketMQ consistently achieved lower latency (sub‑millisecond for up to 40 k TPS) than Kafka, which exceeded 1 ms after 10 k TPS.

3. Architecture Evolution

The current architecture consists of multi‑language client SDKs (7 languages), a production proxy, a storage layer (RocketMQ as the main engine, Kafka during migration, and Chronos for delayed messages), and a consumption proxy that supports multiple protocols (HTTP/REST, Groovy scripts for Redis/HBase/HDFS, etc.).

Management consoles (user and ops) allow topic creation, resource quota configuration, and cluster monitoring. All configuration changes are synchronized via ZooKeeper.

3.1 Challenges

Support for PHP, Go, Java, C++ clients.

Only three developers in the team.

Limited familiarity with RocketMQ source code.

Tight rollout schedule while Kafka was still problematic.

High availability requirements.

3.2 Migration Strategies

Two main approaches were implemented:

Dual‑write (producer side) : ProducerProxy writes to both Kafka and RocketMQ, ensuring full data replication during migration. Two variants exist – client‑side dual‑write and Proxy‑side dual‑write.

Dual‑read (consumer side) : ConsumerProxy reads from both Kafka and RocketMQ, guaranteeing no duplicate consumption and allowing a smooth switch‑over.

Both approaches enable a gradual migration with minimal impact on production services.

4. RocketMQ Extensions and Refactoring

Automatic master‑slave failover : Added a custom failover mechanism because the open‑source broker lacks it.

Batch production across topics : Extended RocketMQ’s batch API to support different topics and consumer queues.

Metadata management : Optimized the metadata layer to allow a single broker to handle up to a million topics.

5. Operational Tips

5.1 Reading old data

Enable slaveReadEnable=true on both master and slave so that consumers can fetch stale data from slaves when the offset gap exceeds the in‑memory ratio.

5.2 Expired‑data deletion

Default retention is 72 hours with deletion at 04:00. To reduce I/O spikes, adjust deleteCommitLogFilesInterval and deleteConsumeQueueFilesInterval (e.g., 100 ms) or spread deletion across multiple hours using the deleteWhen parameter.

5.3 Indexing

Disable messageIndexEnable on masters and enable it only on slaves to offload index queries from the write path.

These practices help maintain high throughput and low latency in large‑scale production environments.

distributed systemsoperationsPerformance Testingsystem migrationMessage Queuerocketmq
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.