Operations 22 min read

DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations

Didi’s ride‑hailing R&D team addresses efficiency and stability challenges of a large micro‑service ecosystem by unifying a Go stack, common framework, and data models, using eBPF traffic recording for automated regression testing, and applying AIOps alert filtering, knowledge‑graph root‑cause analysis, and a localization robot for rapid fault recovery, while targeting full CI/CD automation with static analysis, service‑mesh observability, and chaos engineering.

Didi Tech
Didi Tech
Didi Tech
DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations

Efficiency and system stability are perpetual concerns for any R&D team. Efficiency determines the speed of business iteration, while stability determines delivery quality. Didi has accumulated extensive practice in improving both, and the recent Gopher China talk by Wei Jingwu, head of ride‑hailing R&D efficiency and stability, shares these experiences.

The talk begins by describing the tension between handling incidents under heavy pressure and meeting new business demands. The speaker argues that DevOps is the breakthrough, as it covers the entire delivery pipeline.

Key challenges are divided into business and technical layers. Business challenges stem from a complex product portfolio (e.g., various ride‑hailing services) that translates into a massive micro‑service architecture. Technical challenges arise from large‑scale cloud migration, multi‑region active‑active deployments, and the exponential growth of services (from dozens to thousands), which amplifies any small issue.

To cope, Didi adopts three unifications in the development stage:

Unified Go stack – Go was chosen for its performance and existing internal adoption, easing migration from PHP.

Unified framework – Non‑business logic is encapsulated in a shared framework, with Thrift IDL extensions to support HTTP and gRPC while preserving compatibility.

Unified data handling – Standardized data models reduce duplicated effort and improve governance.

In the testing stage, Didi tackles two major problems: building a scalable test environment for thousands of services, and achieving high‑coverage regression testing. The solution combines traffic recording and replay using kernel‑level eBPF hooks (Cilium) for Go and C++ services, and CGO + LD_PRELOAD for PHP. Recorded sessions capture full request/response context, including downstream RPC, MySQL, and Redis calls, enabling precise replay and diff against production.

Recorded traffic is filtered, indexed in Elasticsearch, and used to generate test cases automatically. Engineers can query specific scenarios and replay them, achieving 2‑3% coverage difference compared with live traffic.

For operations, Didi faces alarm overload and slow root‑cause localization. The company applies AIOps, leveraging machine‑learning for anomaly detection, metric decomposition, and automated alert generation. Root‑cause analysis combines knowledge‑graph construction, trace data, metric correlation, and change‑event tracking. Automated alert shielding merges similar alerts and suppresses noise based on context.

Fault recovery is supported by a “定位机器人” (定位机器人 – a localization robot named Donghai Longwang) that quickly retrieves relevant information (error codes, traces, RPC details) and can trigger automated traffic diversion when a specific region is identified as faulty.

Looking forward, Didi aims to fully automate the CI/CD pipeline—from code review to deployment and rollback—by integrating static analysis, compile‑time checks, framework constraints, and AIOps‑driven monitoring. This automation, combined with cloud‑native capabilities (service mesh, observability, chaos engineering), is expected to raise the baseline of both efficiency and stability.

monitoringmicroservicesautomationCloudNativeTestingDevOpsAIOps
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.