Backend Development 14 min read

How Didi Scaled Ride‑Hailing: LBS, Long‑Connection, and Real‑Time Data Solutions

Facing explosive traffic growth in 2014, Didi’s ride‑hailing platform tackled critical challenges by redesigning its LBS architecture, replacing unstable long‑connection services with an AIO‑based framework, partitioning databases, adopting Dubbo and RocketMQ for distributed processing, and building a real‑time monitoring and data center using Storm, HBase, and custom SQL‑to‑HBase translation.

21CTO

Jan 8, 2016

How Didi Scaled Ride‑Hailing: LBS, Long‑Connection, and Real‑Time Data Solutions

LBS Bottleneck and Solution

Drivers report their GPS coordinates every few seconds, stored in MongoDB; passenger orders query nearby drivers from MongoDB; orders are pushed to drivers via a long‑connection service; drivers accept orders and start service.

MongoDB’s primary‑secondary replica set faced severe write (40k+/s) and read (10k+/s) loads, causing CPU spikes, query latency over 800 ms, reduced throughput, and replication lag. The root cause was MongoDB 2.6.4’s global write lock and LBS queries splitting into many sub‑queries, increasing lock contention.

The solution split the nation into four regions, deploying independent MongoDB clusters per region, isolating load.

Long‑Connection Service Stability

The original long‑connection service used Socket for heartbeats and message pushes but suffered instability during peak periods.

Hardware issue: a single‑queue NIC bound all I/O interrupts to one CPU core, causing 100% usage on that core while others stayed idle, leading to packet loss and LVS disconnections. Replacing it with a multi‑queue NIC resolved the problem.

Software issue: the service was built on Mina, which had coarse memory‑pool control, difficult GC, inefficient idle‑connection checks, and high CPU usage under load. It was rewritten using AIO, adding custom features such as ByteBuffer pooling, broadcast buffer reuse, TimeWheel idle‑connection detection, and priority‑based data sending.

System Distributed Refactoring

The original system consisted of a Web HTTP service and a TCP long‑connection service, with massive monolithic code causing long build times and poor scalability.

The codebase was split into three layers: business, service, and data. Strong dependencies were handled with Dubbo for RPC and service governance; weak dependencies used RocketMQ. Both Dubbo and RocketMQ are Alibaba open‑source projects, well‑understood by the team.

R&D processes, code standards, and SQL guidelines were introduced, along with service degradation mechanisms.

Wireless Open Platform (KOP)

KOP addressed client‑server communication issues such as per‑business request code changes, lack of request/response standards, uncontrolled request volume, scattered business logic, and outdated documentation.

Access control with client identifiers and secret keys for request signing.

Traffic allocation and degradation per client, city, or platform, with AB‑testing and core API protection.

Traffic analysis across client, API, IP, and user dimensions to detect malicious requests.

Real‑time API publishing without redeploying KOP, with audit mechanisms.

Real‑time monitoring of API call counts, success/failure rates, latency, with alerting on thresholds.

Real‑Time Computation and Monitoring

Built on Storm and HBase, the monitoring platform processes logs in real time, aggregates metrics per minute, and stores results in HBase. Data flow: logs → Storm bolts (parse KV) → RocketMQ → HBase (insert‑only) and MetaQ for buffering, ensuring stability under spikes.

Key components include core calculation (sum, avg, group), Storm bolts for parsing and KV processing, HBase insert‑only storage to avoid row‑lock contention, and RocketMQ buffering to handle traffic spikes.

Data Layer Refactoring

To meet growing performance demands, the system moved from single‑database tables to client‑side sharding with a custom framework. Challenges included synchronizing multiple front‑end databases to a single back‑end database and handling large‑scale offline data extraction with Sqoop, which caused instability and duplicate extraction.

Data Synchronization Platform

Implemented using open‑source Canal to capture MySQL binlog in row mode, converting it to MQ messages for downstream consumption. Features: multi‑consumer support, global/local ordering, replay at specific timestamps, link monitoring and alerts, and automated node deployment via a management console.

Real‑Time Data Center

Front‑end MySQL shards are synchronized to HBase, enabling existing applications to query via an SQL‑to‑HBase translation layer that supports secondary indexes, rowkey‑based sorting, and key hashing to distribute load across regions.

Secondary indexes are implemented client‑side (no Coprocessor) to allow bulk inserts, improving write performance. Index configuration also influences multi‑table unification, requiring unique constraints on concatenated fields.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Real-time Processing Scalable Architecture database sharding Ride Hailing

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.