Artificial Intelligence 19 min read

DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

Since 2016 DiDi has evolved its machine‑learning platform from isolated, workshop‑style GPU servers to a cloud‑native, Kubernetes‑driven architecture that unifies resource management, introduces custom parameter‑server and serving frameworks, provides autotuning, external SaaS offerings such as Elastic Inference and JianShu, and aims for a 3.0 unified internal‑external AI marketplace.

Didi Tech

Apr 4, 2019

DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

DiDi started building its machine learning platform in 2016. Early on, each algorithm team operated in a small‑scale, workshop‑style environment using expensive GPU servers for deep‑learning workloads. While this model offered flexibility, it suffered from poor resource coordination, duplicated effort, and limited scalability.

Version 1.0 of the platform focused on unifying resource management both offline (GPU server selection, testing, and deployment) and online (dynamic resource allocation, isolation, and scheduling). Docker was adopted for weak isolation and environment management, while Yarn was extended to support GPU‑aware scheduling. The platform also introduced a shared storage layer based on a high‑bandwidth PFS backed by RoCE‑enabled Ethernet.

As the platform matured, the workshop‑style approach gave way to a centralized production model (Version 2.0). The team migrated from Yarn to Kubernetes to leverage its richer multi‑resource scheduling, container orchestration, and better integration with cloud services. A custom parameter server with a ring topology and an optimized RDMA All‑Reduce algorithm was developed, outperforming OpenMPI and Nvidia NCCL2 on a 40 Gbps RoCE v2 network.

DiDi also built the DDL (DiDi Deep Learning) Serving framework, which introduced adaptive batching, RPC optimizations, and achieved up to three‑fold latency improvements for lightweight models. To address the diversity of hardware, the IFX deep‑learning framework was created, offering GPU‑native concurrency, extensive operator optimizations, and a mobile‑device variant that performs better than TensorFlow and TensorRT on targeted chips.

For low‑level performance tuning, an autotuning toolchain supporting Kepler, Pascal, and Volta assembly was released, allowing users to submit GPU binaries and receive near‑optimal compiled code for their hardware.

Beyond internal use, DiDi Cloud now offers the platform as a service. GPU resources are exposed via KVM‑based GPU passthrough for stronger isolation. The Elastic Inference Service (EIS) builds on DDL Serving to provide a turnkey model‑deployment solution that abstracts hardware details, reduces repetitive optimization work, and offers fine‑grained QPS‑based billing.

To further lower the barrier for external customers, DiDi introduced “JianShu”, a packaged solution that bundles weak‑isolation resource management, task scheduling, monitoring, and rapid service deployment, enabling other companies to adopt platform capabilities without reinventing the wheel.

Looking ahead, DiDi aims to reach a 3.0 stage where the platform provides a unified internal‑external architecture, supports AI marketplaces (algorithm, model, data), GUI tools, and domain‑specific services such as face verification, speech recognition, and translation, while continuously improving compute efficiency and cost effectiveness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

platform engineering Kubernetes Resource Management AI Infrastructure GPU computing

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.