Artificial Intelligence 10 min read

DiDi PS: High-Performance RDMA-Based Parameter Server for Distributed Deep Learning

DiDi PS is a custom RDMA‑based parameter server that uses a ring topology and optimized ibverbs communication to dramatically accelerate distributed deep‑learning training, consistently outperforming OpenMPI, NCCL2, TensorFlow’s built‑in RDMA, and Horovod while providing more stable and scalable synchronization for massive data workloads.

Didi Tech
Didi Tech
Didi Tech
DiDi PS: High-Performance RDMA-Based Parameter Server for Distributed Deep Learning

In the field of machine learning, distributed training is a persistent challenge. Single‑node hardware improvements often cannot meet the ever‑increasing compute demands of business, so distributed training aims to achieve horizontal scaling, providing higher overall compute power to process more data or run more experiments in a short time, ultimately yielding higher‑accuracy models. At DiDi’s machine learning platform, distributed training ensures timeliness and efficiency both in experimental model tuning and in production model updates, and it also underpins unconventional methods such as automated hyper‑parameter search.

Many machine‑learning frameworks already offer distributed training solutions, but DiDi faces greater challenges than most industry peers. The platform serves not only traditional image, speech, and video workloads but also massive traffic, map, and road‑condition data. For example, a single day of ETA (estimated time of arrival) model training in Beijing exceeds the size of all public image datasets, and the platform processes over 30 million orders per day, generating an enormous volume of traffic data.

In distributed training, network traffic is large, making parameter synchronization performance critical. Parameter servers (PS) handle this synchronization, and their performance directly impacts overall training speed. While academia and industry often focus on gradient compression or asynchronous SGD, they pay less attention to the underlying synchronization mechanisms. Improving PS synchronization can benefit a wide range of algorithms and techniques.

Current industry synchronization solutions typically use Allreduce via frameworks such as OpenMPI or NVIDIA’s NCCL2. OpenMPI is an open‑source implementation of the MPI standard, supporting TCP, RDMA, and other transports. NCCL provides high‑performance collective communication for GPUs. Many PS implementations, like Uber’s Horovod, rely on these Allreduce APIs. However, our tests show that both OpenMPI and NCCL2 fall short of ideal performance, leaving substantial optimization space.

DiDi primarily uses Google’s TensorFlow for deep learning. Although TensorFlow offers a distributed training solution, it relies on gRPC over TCP, which delivers sub‑optimal performance and usability. TensorFlow’s RDMA support is an afterthought, requiring dynamic tensor memory allocation that forces frequent memory registration, introduces extra metadata caching, and restricts message pipelining, all of which degrade performance.

There are many other detailed inefficiencies; interested readers can consult TensorFlow’s official RDMA documentation or source code. Moreover, even the latest TensorFlow 1.8 binaries do not enable RDMA by default, requiring manual compilation.

Among open‑source PS solutions, Uber’s Horovod is popular but still depends on MPI or NCCL for the low‑level Allreduce, which we consider not the most efficient. Consequently, DiDi developed its own RDMA‑based parameter server, DiDi PS, emphasizing performance and efficiency.

DiDi PS adopts a ring topology and implements an efficient Allreduce algorithm using the ibverbs API. Unlike a centralized server‑client model, the ring reduces communication volume and avoids bandwidth contention and network congestion. Our custom RDMA library minimizes ibverbs overhead, overlaps computation with data transfer, reduces unnecessary memory copies, and accounts for GPU topology and CPU affinity.

We benchmarked DiDi PS against OpenMPI and NCCL2 on Allreduce primitives across various cluster sizes (2, 3, 4, and 5 machines with 8‑12‑16‑20 nodes respectively). Using a 256 MiB single‑precision average operation, DiDi PS consistently outperformed the other solutions.

To facilitate adoption, we wrapped DiDi PS as a TensorFlow operation, allowing it to be used like native TensorFlow ops. We further evaluated end‑to‑end training performance using the Inception‑v3 model on ImageNet, comparing DiDi PS, TensorFlow’s built‑in distributed training, and Uber’s Horovod. DiDi PS again showed superior performance and more stable behavior; Horovod and TensorFlow’s RDMA implementations exhibited large performance fluctuations and occasional crashes.

Although DiDi PS already delivers impressive performance, ongoing optimizations continue to push its capabilities further to meet future business needs.

PerformanceDeep LearningTensorFlowdistributed trainingAllreduceparameter serverRDMA
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.