Artificial Intelligence 15 min read

Bagua: An Open‑Source Distributed Training Framework for Deep Learning

Bagua is a distributed training framework co‑developed by Kuaishou and ETH Zürich that combines algorithmic and system‑level optimizations—such as decentralized, asynchronous, and compressed communication—to achieve up to 60% higher performance than existing frameworks like PyTorch‑DDP, Horovod, and BytePS across various AI workloads.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Bagua: An Open‑Source Distributed Training Framework for Deep Learning

Recently, Kuaishou and ETH Zürich announced the open‑source distributed training framework Bagua, which goes beyond system‑only optimizations of existing deep‑learning frameworks (e.g., PyTorch, TensorFlow) by jointly optimizing algorithms and system layers, delivering up to 60% performance improvement over peers.

Background: As Moore's law stalls, single‑device compute cannot keep up with exponential data growth (e.g., Kuaishou uploads >10 million videos daily). Single‑GPU training of models like ResNet would take over 100 days, making multi‑node, multi‑GPU parallel training essential, yet communication overhead often limits speedup.

Bagua addresses this by designing specific optimization algorithms for distributed scenarios, offering a set of communication options: centralized vs. decentralized, synchronous vs. asynchronous, and full‑precision vs. low‑precision (quantization or sparsification). These options can be combined flexibly, and the framework guarantees convergence and efficiency comparable to traditional methods.

Key system optimizations include hiding communication time within computation, parameter bucketing with contiguous memory management, and hierarchical communication that distinguishes intra‑node and inter‑node traffic, allowing the most suitable algorithm for each physical link.

Experimental results show that on 128 GPU clusters Bagua reaches the same accuracy as PyTorch‑DDP, Horovod, and BytePS in only ~60% of the time, is more robust to low‑bandwidth/high‑latency networks, and provides a one‑click integration for existing PyTorch models.

Bagua also supports Kubernetes‑native deployment via a custom operator, offering fault tolerance and dynamic scaling, and has been validated in Kuaishou's production workloads, achieving 20‑30% speedups for large‑scale image, speech, and recommendation tasks, and over 100% improvement for trillion‑parameter recommendation models.

Usage example (Python):

import bagua.torch_api as bagua
torch.cuda.set_device(bagua.get_local_rank())
bagua.init_process_group()
# dataset and DataLoader setup (compatible with PyTorch)
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=bagua.get_world_size(), rank=bagua.get_rank())
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, sampler=train_sampler)
# model and optimizer
model = model.cuda()
optimizer = ...
from bagua.torch_api.algorithms import gradient_allreduce
algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
model = model.with_bagua([optimizer], algorithm)

For more details, see the Bagua GitHub repository and the paper https://arxiv.org/abs/2107.01499.

Deep LearningPyTorchdistributed trainingBaguacommunication optimizationGPU scaling
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.