Artificial Intelligence 22 min read

Parallel Strategies for Distributed Deep Learning Training

This article reviews distributed training techniques for large deep‑learning models, covering data parallelism, model parallelism (including pipeline and tensor parallelism), gradient bucketing and accumulation, 3D parallelism, and practical implementations such as Megatron‑LM and 360AI platform optimizations.

360 Smart Cloud

Jan 26, 2024

Parallel Strategies for Distributed Deep Learning Training

1. Background

With the rapid growth of deep learning, data volumes and model sizes have exploded, making single‑GPU training infeasible. Distributed training has become a key research direction, especially after the 2023 surge of large language models, which brings new challenges in resource utilization and network bandwidth.

Early deep‑learning models fit on a single GPU, but as data scales exponentially, data parallelism emerged to split the dataset across multiple nodes, reducing training time linearly. Simultaneously, model parallelism was introduced to split oversized models across GPUs, enabling training of models that cannot fit into a single device.

This article focuses on parallel strategies for distributed training, including data parallelism, model parallelism (pipeline and tensor parallelism), all of which are implemented in Megatron‑LM.

2. Data Parallel

2.1 Basic Principle

Training can be expressed as linear‑algebra operations. A toy model with a 32‑float weight matrix (64 bytes in fp16) fits on one GPU. The training loop performs matrix multiplication, loss computation, back‑propagation, and weight update. When the dataset grows, the same model is replicated on multiple GPUs, each processing a distinct data shard. After each iteration, gradients from all GPUs are averaged via an All‑Reduce operation and broadcast back, ensuring model consistency.

2.2 All‑Reduce

All‑Reduce aggregates gradients across N nodes (often N GPUs). Ring All‑Reduce, proposed by Baidu in 2016, splits the operation into Reduce‑Scatter and All‑Gather phases, achieving constant‑time communication theoretically. The process reduces communication volume per node to roughly twice the model size, though real‑world factors such as slow nodes and network latency still limit scalability.

2.3 Gradient Bucketing and Accumulation

PyTorch Distributed Data Parallel (DDP) introduces gradient bucketing: parameters are grouped into buckets; once a bucket’s gradients are ready, an All‑Reduce is performed, overlapping communication with computation. Gradient accumulation further increases effective batch size by accumulating micro‑batch gradients locally before a single synchronization step, reducing communication frequency and alleviating GPU memory pressure.

2.4 Communication Analysis in Megatron‑LM

For a 1.8 B‑parameter GPT model (seq‑length = 1024, 24 layers, hidden size = 2048) the model occupies ~3535 MB. On a 2‑GPU node (A100 with NVLink) data‑parallel size = 2, resulting in ~3535 MB of inter‑GPU traffic per iteration. Scaling to 8 GPUs raises traffic to ~6186 MB. Multi‑NIC setups form multiple communication rings, achieving near‑linear speedup.

3. Model Parallel

3.1 Pipeline Parallel

Naïve Pipeline and Micro‑Batch Pipeline

Pipeline parallelism splits model layers across GPUs, forming stages. The naïve approach suffers from “bubble” idle time because only one stage computes at a time. GPipe mitigates this by dividing each mini‑batch into micro‑batches, increasing GPU utilization.

1F1B Strategy

PipeDream introduces the 1F1B (One Forward, One Backward) schedule, which interleaves forward and backward passes to reduce peak activation memory. Both interleaved and non‑interleaved variants are supported in Megatron‑LM via forward_backward_pipelining_without_interleaving and forward_backward_pipelining_with_interleaving.

3.2 Tensor Parallel

Basic Principle

Tensor parallelism splits weight matrices either row‑wise or column‑wise across GPUs. Row‑wise splits require corresponding data splits; column‑wise splits keep the data unchanged. After local computation, an All‑Gather (or All‑Reduce) merges partial results.

Splitting MLP

The MLP in a Transformer layer is split into two linear layers. Row‑wise splitting of the first linear layer requires synchronization before the non‑linear activation, while column‑wise splitting avoids this synchronization. The second linear layer is typically split row‑wise, followed by an All‑Reduce to combine results.

Splitting Self‑Attention

Multi‑head attention naturally fits tensor parallelism: the attention matrix is divided column‑wise across GPUs, while the subsequent MLP layers use row‑wise splits. This results in four All‑Reduce operations per Transformer layer (two forward, two backward), with communication volume proportional to batch size and sequence length.

4. 3D Parallelism in Practice

4.1 Overview

Real‑world training often combines Data Parallelism, Tensor Parallelism, and Pipeline Parallelism (3D parallelism). An example from the BLOOM 176 B model trained on 384 A100 GPUs uses 8 DP groups, tensor‑parallel size = 4, and pipeline‑parallel size = 12.

4.2 360AI Platform Support

The 360AI platform streamlines large‑scale 3D parallel training by offering multi‑framework support (Megatron‑LM, DeepSpeed, etc.), rapid resource allocation via Kubernetes, comprehensive monitoring, fault detection and auto‑recovery, heterogeneous hardware compatibility, and network/system optimizations such as GPU‑Direct RDMA.

5. Conclusion

The article outlines the fundamentals of data and model parallelism, analyzes their communication costs, and describes how 3D parallelism is deployed in the 360 ecosystem. Future work includes better support for Mixture‑of‑Experts, longer context lengths, and tighter integration with emerging hardware and communication technologies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Deep Learning Megatron-LM model parallelism data parallelism

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.