How Didi’s Jianshu Machine Learning Platform Boosts AI Development Efficiency

An in‑depth look at Didi’s Jianshu Machine Learning Platform reveals its end‑to‑end AI workflow—from experiment environments and batch training to high‑availability online serving—highlighting resource‑efficient Kubernetes scheduling, Docker‑based reproducible environments, a custom parameter server, and the IFX inference engine that together accelerate development, training, and deployment.

Didi Tech
Didi Tech
Didi Tech
How Didi’s Jianshu Machine Learning Platform Boosts AI Development Efficiency

Introduction

Didi’s Jianshu Machine Learning Platform is the AI‑foundation component of the “Group Goose” initiative. It provides a unified environment that covers the full lifecycle of deep‑learning projects: model research, batch training, and online inference for intelligent‑mobility services.

Platform Architecture

The system is organized into three functional modules:

Experiment Environment – Interactive Jupyter, SSH, and VNC sessions for rapid prototyping.

Offline Training – Scalable batch training on GPU clusters managed by Kubernetes.

Online Service – High‑availability model serving with built‑in load balancing and storage integration.

Underlying storage consists of a network file system for hot data, an image repository for environment snapshots, and a load‑balancer to guarantee service reliability. The platform interoperates with existing HDFS, Spark, and other big‑data tools.

<img src="https://mmbiz.qpic.cn/mmbiz_png/jE5bOw22iaBvzo0M912TxP4UJEnxUDticDb80iaiaFyicSwrgJI88Jzxibtvj8qqnyccpnaGLkYPUo5RdXQicAwjS6rmQ/640" alt="Platform Overview Diagram"/>

Key Technical Capabilities

Resource‑Efficient Utilization

Kubernetes is used for dynamic GPU allocation. Resources are provisioned on demand and released automatically, turning a large GPU pool into an elastic shared resource pool.

Accelerated Development

Docker images encapsulate the full development stack (OS, CUDA, Python, TensorFlow, PyTorch, Caffe, Jupyter, etc.). Users build a custom image once and reuse it across experiments, reducing environment‑setup time from days to under a minute. The platform provides SSH, VNC, and Jupyter interfaces for immediate access.

Fast Offline Training

The platform parses user Python code to automatically identify hyper‑parameters, generates a full set of training jobs, and launches them on the GPU cluster. A custom Didi parameter server implements a ring‑based Allreduce algorithm over RDMA‑optimized communication. The ring topology eliminates central bottlenecks, overlaps computation with data transfer, and respects GPU topology and CPU affinity. Benchmarks on a 40 Gbps RoCE v2 network show higher throughput and lower latency than OpenMPI and NVIDIA NCCL2.

High‑Performance Online Inference

Two components accelerate serving:

Elastic Inference Service (EIS) – Adaptive batching and an optimized RPC protocol. In an MNIST benchmark, EIS achieves 2.5× higher QPS and one‑third the latency of TensorFlow Serving on identical hardware.

IFX Runtime – A custom deep‑learning runtime that manages GPU contexts, reorders instructions, and optimizes memory access. IFX outperforms TensorFlow and TensorRT on both server‑grade GPUs and mobile devices.

Autotuning

An autotuning toolchain automatically generates near‑optimal GPU binaries for a given model, handling BLAS tuning and hardware‑specific adjustments without user intervention.

Deployment Details

Experiment environments are launched via Jupyter, SSH, or VNC; Docker images can be user‑defined or shared across teams. Offline training jobs are scheduled by Kubernetes, while the parameter server’s ring Allreduce leverages RDMA to avoid bandwidth contention. Storage relies on a parallel file system with high‑throughput Ethernet and RoCE support, providing the I/O bandwidth required by GPU‑intensive workloads.

<img src="https://mmbiz.qpic.cn/mmbiz_png/yeVCGwzSKtBUsaRlx5EjupJXEtBO65qBLN9zpHYKictNCF6DKk9JfmLEpDZUYPvswuwibEhludFYnzEHzVPou5Ig/640" alt="Training Architecture Diagram"/>

Future Directions

The roadmap expands the platform with data and algorithm marketplaces, a model marketplace, and additional AI services such as face comparison, speech recognition, and translation. Continued focus will be on scaling GPU resources, improving network‑SDN integration, and enriching reusable assets to further boost AI productivity.

<img src="https://mmbiz.qpic.cn/mmbiz_png/jE5bOw22iaBsljlfO1AYh3BHWjNlasjdtJWAdwFxudUqO5vxokeLbS4mV9meEho837iaiaQWSUBGfI79ts16BO7ag/640" alt="Future Outlook Diagram"/>
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerKubernetesMachineLearningInferenceAIPlatformParameterServer
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.