Artificial Intelligence 19 min read

DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

Since 2016 DiDi has evolved its machine‑learning platform from isolated, workshop‑style GPU servers to a cloud‑native, Kubernetes‑driven architecture that unifies resource management, introduces custom parameter‑server and serving frameworks, provides autotuning, external SaaS offerings such as Elastic Inference and JianShu, and aims for a 3.0 unified internal‑external AI marketplace.

Didi Tech
Didi Tech
Didi Tech
DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

DiDi started building its machine learning platform in 2016. Early on, each algorithm team operated in a small‑scale, workshop‑style environment using expensive GPU servers for deep‑learning workloads. While this model offered flexibility, it suffered from poor resource coordination, duplicated effort, and limited scalability.

Version 1.0 of the platform focused on unifying resource management both offline (GPU server selection, testing, and deployment) and online (dynamic resource allocation, isolation, and scheduling). Docker was adopted for weak isolation and environment management, while Yarn was extended to support GPU‑aware scheduling. The platform also introduced a shared storage layer based on a high‑bandwidth PFS backed by RoCE‑enabled Ethernet.

As the platform matured, the workshop‑style approach gave way to a centralized production model (Version 2.0). The team migrated from Yarn to Kubernetes to leverage its richer multi‑resource scheduling, container orchestration, and better integration with cloud services. A custom parameter server with a ring topology and an optimized RDMA All‑Reduce algorithm was developed, outperforming OpenMPI and Nvidia NCCL2 on a 40 Gbps RoCE v2 network.

DiDi also built the DDL (DiDi Deep Learning) Serving framework, which introduced adaptive batching, RPC optimizations, and achieved up to three‑fold latency improvements for lightweight models. To address the diversity of hardware, the IFX deep‑learning framework was created, offering GPU‑native concurrency, extensive operator optimizations, and a mobile‑device variant that performs better than TensorFlow and TensorRT on targeted chips.

For low‑level performance tuning, an autotuning toolchain supporting Kepler, Pascal, and Volta assembly was released, allowing users to submit GPU binaries and receive near‑optimal compiled code for their hardware.

Beyond internal use, DiDi Cloud now offers the platform as a service. GPU resources are exposed via KVM‑based GPU passthrough for stronger isolation. The Elastic Inference Service (EIS) builds on DDL Serving to provide a turnkey model‑deployment solution that abstracts hardware details, reduces repetitive optimization work, and offers fine‑grained QPS‑based billing.

To further lower the barrier for external customers, DiDi introduced “JianShu”, a packaged solution that bundles weak‑isolation resource management, task scheduling, monitoring, and rapid service deployment, enabling other companies to adopt platform capabilities without reinventing the wheel.

Looking ahead, DiDi aims to reach a 3.0 stage where the platform provides a unified internal‑external architecture, supports AI marketplaces (algorithm, model, data), GUI tools, and domain‑specific services such as face verification, speech recognition, and translation, while continuously improving compute efficiency and cost effectiveness.

Machine Learningplatform engineeringkubernetesresource managementAI InfrastructureGPU computing
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.