Why Didi Chose RocketMQ: Lessons from Building a Scalable Message Queue Service

This article recounts Didi's journey from a chaotic mix of Kafka, RocketMQ, Redis, and other queues to a unified, high‑performance messaging platform built on Apache RocketMQ, covering the reasons for abandoning Kafka, the architecture evolution, migration strategies, performance benchmarks, and operational enhancements.

Java Backend Technology
Java Backend Technology
Java Backend Technology
Why Didi Chose RocketMQ: Lessons from Building a Scalable Message Queue Service

Background and History

Didi initially operated without a dedicated team for message‑queue services, using a variety of solutions such as Kafka, RocketMQ, Redis lists, and even beanstalkd, which led to maintenance difficulties and resource waste.

Why Kafka Was Dropped

Critical business services experienced severe write jitter and failures on Kafka 0.8.2 due to growing topic volume and a bug that caused excessive replica copying on mechanical disks.

Why RocketMQ Was Chosen

After extensive research and testing, Didi selected RocketMQ for its multi‑language support, better handling of migration challenges, and ability to meet special business requirements.

Architecture Evolution

The new framework places client applications behind a proxy layer; the proxy handles message storage (primarily RocketMQ, with some Kafka during migration) and provides unified APIs for producers and consumers across multiple languages and protocols.

Migration of all heterogeneous queues to the new platform.

Feature iteration and cost‑performance optimization.

Self‑service resource provisioning via a web console.

Performance Testing

Topic‑Count Support

Tests comparing Kafka 0.8.2 and RocketMQ 3.4.6 under a 1 Gbps network and 16 threads showed that RocketMQ’s throughput remained stable as the number of topics increased, while Kafka’s throughput degraded sharply.

Latency

Latency measurements under various Ack settings and message sizes demonstrated that RocketMQ consistently achieved sub‑millisecond latency, whereas Kafka exceeded 1 ms once throughput passed 10 k TPS.

Building Our Own Queue Service

Key challenges included supporting multiple client languages (PHP, Go, Java, C++), a small development team, lack of source‑code familiarity, tight release schedules, and high availability requirements.

Solutions involved using Thrift RPC for cross‑language compatibility, simplifying the API to two core calls (send and pull), and delegating advanced features (rate limiting, authentication, filtering, format conversion) to the proxy layer.

Migration Strategies

Dual‑Write

Producers write simultaneously to Kafka and RocketMQ via a proxy, ensuring full data parity during migration; after verification, the Kafka side can be decommissioned.

Dual‑Read

Consumers read from both Kafka and RocketMQ through a proxy, guaranteeing no duplicate consumption and allowing a seamless switch to RocketMQ once data is fully replicated.

RocketMQ Extensions and Operational Experience

Automatic Master‑Slave Failover – Didi added a custom failover mechanism because the open‑source broker lacks automatic role switching.

Batch Production Support – Extended RocketMQ’s batch API to handle multiple topics and consume queues, enabling efficient bulk publishing.

Metadata Management – Refactored the metadata layer to allow a single broker to manage up to a million topics, far beyond the default tens‑of‑thousands.

Reading Old Data – Enabled slaveReadEnable so consumers can fetch data from slaves when the offset exceeds a configurable memory ratio, mitigating disk‑IO pressure.

Expired Data Deletion – Adjusted fileReservedTime, deleteWhen, and deletion intervals to spread I/O load and avoid spikes during nightly cleanup.

Index Management – Disabled indexing on masters and enabled it only on slaves to reduce write‑side I/O while keeping query capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Message QueueRocketMQKafka migration
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.