How Didi Engineered a Full‑Link Load‑Testing Platform to Safeguard Millions of Daily Rides
This article details Didi's 2016 full‑link load‑testing initiative, covering data‑isolation strategies, virtual driver/passenger tooling, trace‑based traffic marking, staged deployment tactics, and the operational insights gained from stress‑testing a massive ride‑hailing platform.
Background
Didi Chuxing, founded in 2012, became a leading one‑stop ride‑hailing platform with daily orders surpassing ten million, driving rapid IT growth and increasing system complexity.
Why Full‑Link Load Testing?
By 2016, daily orders jumped from one million to ten million, and frequent online failures highlighted the need for robust stability. Didi therefore launched a full‑link load‑testing project to validate the entire business chain under realistic conditions.
Load‑Testing Plan
The plan runs in the online environment, applying data isolation to the core business chain.
Data Isolation
Isolation prevents virtual orders from mixing with real orders, avoiding issues such as fake orders affecting driver scores, real passengers receiving virtual rides, BI report anomalies, and capacity‑estimation errors. Scenarios of poor isolation are illustrated in a list.
Fake orders corrupt driver points and coupons.
Real passengers are assigned virtual drivers.
BI reports show unexpected order spikes.
Capacity estimates become erratic.
Accidental deletion of real data during cleanup.
The solution uses ID or flag fields to distinguish virtual from real entities.
Virtual Order Scheme
Virtual drivers and passengers are created per city, with orders marked by special identifiers. This avoids invasive changes to the dispatch logic and isolates virtual traffic from real traffic across order history, notifications, and statistics.
Virtual City Concept
To further isolate, Didi can create entire virtual cities (or even a virtual country) where all coordinates are shifted to the Pacific, allowing complete separation of virtual traffic from production while preserving realistic routing.
Traffic Marking via Trace
Didi's internal trace system (similar to Google Dapper) tracks call chains. Two marking options were considered: using business IDs in each service or extending the trace channel to carry a load‑testing flag. The trace‑based approach was chosen for full decoupling.
Toolside Implementation
Virtual driver and passenger clients simulate massive users. Drivers use TCP long connections for order dispatch; passengers use HTTP/HTTPS. Distributed agents fetch user profiles from a data center to avoid duplicate logins.
Dynamic Business Model
The model allows on‑the‑fly adjustment of order type ratios (e.g., local vs. inter‑city rides) without code changes, supporting diverse scenarios such as peak‑hour versus holiday traffic.
Staged Deployment Strategy
To ensure order‑matching rates, drivers and passengers are gradually injected into a virtual city, starting with a hotspot (e.g., Beijing's Dongdan) to guarantee a proportional increase in successful matches.
Load‑Testing Execution
From early 2016, Didi conducted multiple staged tests during low‑traffic windows, gradually increasing pressure while monitoring system health. Issues uncovered included API latency spikes, misconfigured long‑connection parameters, Codis timeouts, and excessive logging causing dispatch delays.
Benefits and Future Outlook
The project drove convergence of language‑specific component libraries, expanded trace coverage, and established a fully isolated online environment for future correctness verification, fault injection, gray‑release validation, and capacity planning.
Source: https://dwz.cn/suEerYg8 – Author: Zhang Xiaoqing
Architecture Talk
Rooted in the "Dao" of architecture, we provide pragmatic, implementation‑focused architecture content.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
