Why Didi Chose RocketMQ: Lessons from Building a Scalable Message Queue Service
This article recounts Didi's journey from a chaotic mix of Kafka, RocketMQ, Redis, and other queues to a unified, high‑performance messaging platform built on Apache RocketMQ, covering the reasons for abandoning Kafka, the architecture evolution, migration strategies, performance benchmarks, and operational enhancements.
Background and History
Didi initially operated without a dedicated team for message‑queue services, using a variety of solutions such as Kafka, RocketMQ, Redis lists, and even beanstalkd, which led to maintenance difficulties and resource waste.
Why Kafka Was Dropped
Critical business services experienced severe write jitter and failures on Kafka 0.8.2 due to growing topic volume and a bug that caused excessive replica copying on mechanical disks.
Why RocketMQ Was Chosen
After extensive research and testing, Didi selected RocketMQ for its multi‑language support, better handling of migration challenges, and ability to meet special business requirements.
Architecture Evolution
The new framework places client applications behind a proxy layer; the proxy handles message storage (primarily RocketMQ, with some Kafka during migration) and provides unified APIs for producers and consumers across multiple languages and protocols.
Migration of all heterogeneous queues to the new platform.
Feature iteration and cost‑performance optimization.
Self‑service resource provisioning via a web console.
Performance Testing
Topic‑Count Support
Tests comparing Kafka 0.8.2 and RocketMQ 3.4.6 under a 1 Gbps network and 16 threads showed that RocketMQ’s throughput remained stable as the number of topics increased, while Kafka’s throughput degraded sharply.
Latency
Latency measurements under various Ack settings and message sizes demonstrated that RocketMQ consistently achieved sub‑millisecond latency, whereas Kafka exceeded 1 ms once throughput passed 10 k TPS.
Building Our Own Queue Service
Key challenges included supporting multiple client languages (PHP, Go, Java, C++), a small development team, lack of source‑code familiarity, tight release schedules, and high availability requirements.
Solutions involved using Thrift RPC for cross‑language compatibility, simplifying the API to two core calls (send and pull), and delegating advanced features (rate limiting, authentication, filtering, format conversion) to the proxy layer.
Migration Strategies
Dual‑Write
Producers write simultaneously to Kafka and RocketMQ via a proxy, ensuring full data parity during migration; after verification, the Kafka side can be decommissioned.
Dual‑Read
Consumers read from both Kafka and RocketMQ through a proxy, guaranteeing no duplicate consumption and allowing a seamless switch to RocketMQ once data is fully replicated.
RocketMQ Extensions and Operational Experience
Automatic Master‑Slave Failover – Didi added a custom failover mechanism because the open‑source broker lacks automatic role switching.
Batch Production Support – Extended RocketMQ’s batch API to handle multiple topics and consume queues, enabling efficient bulk publishing.
Metadata Management – Refactored the metadata layer to allow a single broker to manage up to a million topics, far beyond the default tens‑of‑thousands.
Reading Old Data – Enabled slaveReadEnable so consumers can fetch data from slaves when the offset exceeds a configurable memory ratio, mitigating disk‑IO pressure.
Expired Data Deletion – Adjusted fileReservedTime, deleteWhen, and deletion intervals to spread I/O load and avoid spikes during nightly cleanup.
Index Management – Disabled indexing on masters and enabled it only on slaves to reduce write‑side I/O while keeping query capability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
