Artificial Intelligence 17 min read

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba engineers detail the PAISoar distributed training framework, showing how RDMA‑optimized hardware, Ring AllReduce algorithms, and user‑friendly APIs boost deep‑learning models—like the GreenNet CNN—to 101‑fold speedups on 128 GPUs, dramatically reducing training time from days to under a day.

Alibaba Cloud Developer

Jun 12, 2019

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

1. Overview

In recent years deep learning has advanced rapidly in image processing and speech recognition, with network architectures evolving from AlexNet to Inception‑ResNet and SENet, reducing ImageNet top‑5 error to 2.25%.

As model depth and data volume grow, efficient distributed training becomes critical. TensorFlow’s default parameter‑server mode faces challenges such as unbalanced variable placement, limited PS bandwidth, and the need for extensive hyper‑parameter tuning.

2. PAISoar: A Distributed Training Framework Based on PAI TensorFlow

2.1 PAISoar Overview

PAISoar provides an end‑to‑end solution from hardware to software for high‑performance distributed training.

2.1.1 Hardware Layer

Built the first large‑scale RoCE‑based RDMA cluster in the group, using Mellanox 25 GbE NICs for low‑latency, high‑throughput lossless transmission.

Deployed 8×100 Gb optical links between access and aggregation switches, achieving 1:1 convergence.

Implemented a multi‑level TCP/RDMA flow‑control strategy to mitigate traffic interference.

2.1.2 Software Layer

Integrated RDMA drivers into PAI TensorFlow and tuned verbs‑based communication.

Optimized the critical path of RDMA communication by accelerating memory copy, asynchronous data sending, and state‑machine improvements.

Developed a highly optimized Ring AllReduce algorithm tailored for RDMA networks, greatly increasing multi‑node training performance.

2.1.3 API Layer

Provided ReplicatedVarsOptimizer to simplify conversion of single‑node models to distributed training.

Introduced smooth_exponential_decay for learning‑rate warm‑up followed by exponential decay, easing convergence.

3. Performance Results

On TensorFlow official benchmarks (Inception v3, ResNet‑50, ResNet‑152, VGG16) PAISoar achieved significant speedups. Replacing the default gRPC communication with RDMA increased performance by up to 44.83% on 64 GPUs. Using Ring AllReduce further boosted speedups, e.g., Inception v3 improved by 84.77% on 64 GPUs.

Compared with Horovod, PAISoar delivered better or comparable gains across the four models.

Figure: performance scaling.

4. RDMA Technology

RDMA enables kernel‑bypass and zero‑copy data transfer, reducing latency to 2‑3 µs and increasing throughput. Alibaba’s data centers adopt RoCE v2 with end‑to‑end QoS (DSCP, PFC, ECN, DCQCN) to guarantee lossless Ethernet.

Figure: stable bandwidth in a 32‑node cluster.

5. Ring AllReduce Technique

Ring AllReduce places devices in a logical ring, allowing each device to send and receive data simultaneously, making communication cost independent of the number of workers.

Algorithm steps:

Split each gradient tensor into num_devices equal shards.

ScatterReduce : num_devices‑1 rounds of communication and addition to compute partial sums.

AllGather : num_devices‑1 rounds to broadcast each shard’s sum to all devices.

Combine shards and divide by num_devices to obtain the averaged gradient.

Figures illustrate ScatterReduce and AllGather phases.

6. GreenNet Model in Security Department

The GreenNet model, originally built in 2013 for pornographic content detection, evolved from a Bag‑of‑Words classifier to a multi‑layer CNN. It serves billions of requests daily with ~80 ms latency.

Training on a single 2‑GPU machine takes 12 days; using PAISoar on 128 GPUs reduces convergence time to under one day, achieving a 101× compute acceleration.

Distributed hyper‑parameter tuning includes data sharding, learning‑rate warm‑up, and scaling the batch size proportionally to the number of workers.

Training on 32 GPUs (16 machines) converged in 20 hours, 14.4× faster than the single‑node baseline, while maintaining identical ROC performance.

Figure: ROC improvement after upgrading to Inception v4.

7. Summary and Outlook

Collaboration between the AIS network team, RDMA project team, and the security department successfully launched PAISoar, delivering up to 101× speedup on 128 GPUs for the GreenNet model. Future work includes exploring new network topologies, parameter‑sparse communication, and further simplifying user adoption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning GPU Acceleration distributed training AI Infrastructure RDMA Ring AllReduce PAISoar

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.