Master Distributed Training for Massive AI Models on Multi‑GPU Clusters
This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.
Background
When you have two 8‑GPU A100 nodes, distributed training is required because a single GPU’s memory is insufficient for large models and high‑bandwidth memory is costly.
Definition
Distributed training spreads a model or its data across multiple GPUs to overcome memory limits and accelerate training.
Methods
Data Parallelism : Replicate the whole model on each GPU and split the input batch.
Model Parallelism : Split a model that does not fit on a single GPU. Includes pipeline parallelism (layer‑wise split) and tensor parallelism (tensor‑wise split).
DeepSpeed ZeRO : Zero‑redundancy optimization that further compresses memory usage, allowing training of even larger models.
Essential concepts
Mixed‑precision training workflow
The typical mixed‑precision loop consists of five steps:
Optimizer creates an FP32 copy of the weights and initializes first‑ and second‑order moments.
A buffer converts the FP32 weights to FP16 for forward and gradient computation.
Forward and backward passes store activations and gradients in FP16.
The optimizer uses FP16 gradients together with FP32 moments to update the FP32 weight copy.
Steps 2‑4 repeat until convergence.
Memory is mainly consumed by four components:
Model weights (FP32 + FP16)
Gradients (FP16)
Optimizer states (FP32)
Activations (FP16)
GPU communication primitives
Within a node GPUs communicate via PCIe or NVLink (NVIDIA only); across nodes they use InfiniBand. NVIDIA’s NCCL library implements the following primitives:
Broadcast : Sends data from one GPU to all others.
Scatter : Splits data on one GPU and distributes chunks to others.
Reduce : Aggregates data from multiple GPUs to a designated GPU (e.g., sum).
Gather : Collects distributed chunks from all GPUs into a single buffer.
AllReduce : Each GPU sends data to all others and receives the reduced result.
AllGather : Each GPU gathers data from all others, concatenating the results.
ReduceScatter : Combines Reduce and Scatter in one step.
Data parallelism
When a model fits comfortably on a single GPU, data parallelism accelerates training by replicating the model on each GPU and increasing the batch size. The classic parameter‑server view treats one GPU as a master that aggregates gradients, averages them, and broadcasts the updated parameters.
Communication volume for N GPUs can be expressed as:
Ring AllReduce replaces the centralized parameter‑server with a decentralized ring topology, balancing the communication load across all GPUs. PyTorch’s Distributed Data Parallel (DDP) implementation uses Ring AllReduce.
Model parallelism
For models with tens or hundreds of billions of parameters, a single GPU cannot hold the entire model. Model parallelism splits the model across GPUs, either layer‑wise (pipeline parallelism) or within layers (tensor parallelism).
Pipeline parallelism
Divides the model layer‑wise. The naïve pipeline processes each layer sequentially in the forward pass and then the backward pass, transmitting only activations between GPUs. This leads to two main inefficiencies:
Low GPU utilization due to “bubble” periods where GPUs idle while waiting for data (bubble ratio = (G‑1)/G for G GPUs).
No overlap between computation and communication.
Micro‑batch pipelines mitigate these issues:
F‑then‑B : Split the batch into multiple micro‑batches; earlier stages start processing the next micro‑batch while later stages are still computing, reducing bubbles.
1F1B : Interleave forward and backward passes on a per‑micro‑batch basis, releasing activations sooner and further decreasing bubble time while also saving memory.
Tensor parallelism
Splits the internal tensors of linear layers. For a linear layer Y = X·A, the weight matrix A can be column‑split (vertical) or row‑split (horizontal). The split choice determines where communication occurs.
Typical patterns:
Feed‑Forward Network (FFN) : Weight A is column‑split, weight B is row‑split, minimizing communication before the non‑linear activation.
Multi‑Head Attention (MHA) : Q, K, V matrices are column‑split; the output matrix O is row‑split, requiring two AllReduce steps.
Embedding layer : Split along the vocabulary dimension (V) because it is orders of magnitude larger than the hidden dimension (H). Input embeddings are column‑split, output embeddings are row‑split.
Hybrid (3‑D) parallelism
Real‑world training often combines data, pipeline, and tensor parallelism. Megatron‑LM exemplifies 3‑D parallelism (Tensor Parallel + Pipeline + Data Parallel). DeepSpeed’s ZeRO optimizations further reduce memory by partitioning optimizer states, gradients, and parameters.
Megatron 3‑D parallelism example
With two 8‑GPU machines (16 GPUs total) training a 4‑layer transformer:
Tensor Parallelism (TP) : Each linear layer is split into two halves.
Pipeline Parallelism (PP) : Layers are distributed across the two nodes.
Data Parallelism (DP) : The TP‑PP block is replicated on both nodes.
DeepSpeed ZeRO stages
Zero‑1 (optimizer‑state partitioning) : Optimizer states (e.g., Adam moments) are sharded across GPUs, reducing per‑GPU memory. Gradients are reduced with Ring AllReduce; updated weights are gathered with AllGather.
Zero‑2 (gradient partitioning) : Gradients are partitioned using ReduceScatter, so each GPU holds only a slice during the update step; AllGather reconstructs the full weight.
Zero‑3 (parameter partitioning) : Model parameters themselves are sharded. The training loop interleaves AllGather and ReduceScatter with forward/backward passes, dramatically lowering peak memory.
A typical ZeRO training pipeline:
Forward + backward to obtain gradients.
ReduceScatter (Zero‑2) or AllReduce (baseline) to average gradients.
Optimizer step using local optimizer states (Zero‑1) and the averaged gradients.
AllGather to synchronize updated parameters (Zero‑3).
Summary
The full stack of distributed training techniques—from basic data parallelism to sophisticated 3‑D parallelism and ZeRO memory optimizations—provides the tools needed to train trillion‑parameter models efficiently on multi‑GPU clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
