Tagged articles

Ring AllReduce

4 articles · Page 1 of 1

Feb 15, 2022 · Artificial Intelligence

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

This article provides a comprehensive overview of Horovod, Uber's open-source distributed deep learning framework, covering its architecture, communication mechanisms, performance benchmarks, and deployment on Kubernetes and Spark for accelerated multi-GPU training.

Deep LearningGPU AccelerationHorovod

0 likes · 17 min read

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

Tencent Cloud Developer

Sep 1, 2021 · Artificial Intelligence

Why Distributed Machine Learning Accelerates AI Training at Scale

This article reviews how distributed machine learning tackles massive data and compute challenges by partitioning models and data across workers, optimizing communication with primitives, parameter servers, and Ring AllReduce, reducing IO overhead, and applying advanced optimizers such as LARS and LAMB to achieve faster, scalable training.

LAMB optimizerLARS optimizerParameter Server

0 likes · 31 min read

Why Distributed Machine Learning Accelerates AI Training at Scale

Alibaba Cloud Developer

Jun 12, 2019 · Artificial Intelligence

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba engineers detail the PAISoar distributed training framework, showing how RDMA‑optimized hardware, Ring AllReduce algorithms, and user‑friendly APIs boost deep‑learning models—like the GreenNet CNN—to 101‑fold speedups on 128 GPUs, dramatically reducing training time from days to under a day.

AI InfrastructureDeep LearningGPU Acceleration

0 likes · 17 min read

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba Cloud Developer

Jun 5, 2017 · Artificial Intelligence

Alibaba’s Distributed Training Boosts Neural Machine Translation Speed

Since its 2013 debut, Neural Machine Translation (NMT) has approached human quality, but training costs are high; Alibaba’s team developed a distributed NMT system in 2017, employing data‑parallel, model‑average, BMUF, Downpour SGD, and Ring‑allReduce techniques to cut training time from over 20 days to a few days while maintaining translation quality.

BMUFDownpour SGDModel Averaging

0 likes · 18 min read

Alibaba’s Distributed Training Boosts Neural Machine Translation Speed