GPU Memory Analysis and Distributed Training Strategies
This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.
GPU Memory Analysis
During fine‑tuning, GPU memory consumption consists of four parts: model parameters, parameter gradients, optimizer state, and intermediate activations.
For a 6‑billion‑parameter model, the memory occupied by the model parameters serves as a baseline; the gradient memory is equal to the parameter memory.
The optimizer typically uses Adam, which stores two additional buffers (m and v) of the same size as the gradients, thus requiring twice the memory of the gradients for optimizer state.
Intermediate results, shaped as [Batch, SeqLen, Dim] , must also be kept in GPU memory for back‑propagation.
Collective Operations
To save memory, models or data can be distributed across multiple GPUs using various collective communication primitives.
Broadcast
The Broadcast operation copies an N‑element buffer from the root rank to all other ranks.
AllReduce, Reduce, ReduceScatter
AllReduce performs a reduction (e.g., sum, min, max) across devices and writes the result to every rank.
Reduce performs the same reduction but writes the result only to a designated root rank.
ReduceScatter performs the reduction and then scatters equal chunks of the result to each rank.
AllGather
AllGather collects N values from k ranks into an output of size k × N and distributes that result to all ranks.
Data Parallelism
Data parallelism splits the training data across multiple nodes, each holding a replica of the model.
Parameter server stores the global model parameters.
Parameters are copied to each device, forming replicas that process a subset of the data.
Each device computes gradients and performs a Reduce operation to obtain the final gradient, which updates the parameter server.
During back‑propagation, Reduce can be applied after each layer to improve parallelism.
Distributed Data Parallel (DDP)
DDP removes the parameter server; each replica holds a full copy of the model.
Each replica processes a portion of the data, performing forward and backward passes.
After computing gradients, an AllReduce synchronizes them across all replicas, and each replica updates its local parameters.
Model Parallelism
When models become too large for a single device, the parameter matrix can be partitioned into sub‑matrices that are processed on different GPUs, reducing per‑GPU memory usage.
Split the parameter matrix into several sub‑matrices and distribute them to different devices.
Each device computes on its sub‑matrix and the partial results are gathered.
ZeRO
ZeRO (Zero Redundancy Optimizer) reduces memory redundancy in distributed data parallel training.
ZeRO‑1 shards optimizer states across devices.
Each replica processes a portion of the input.
Forward pass is performed independently.
Backward pass is performed independently.
After obtaining the full gradient, a ReduceScatter distributes gradient shards to the corresponding replicas.
Each replica updates the parameters corresponding to its gradient shard.
AllGather synchronizes the updated parameters across all replicas.
ZeRO‑2 extends ZeRO‑1 by performing ReduceScatter after each layer’s gradient computation, allowing each replica to keep only a subset of gradients.
ZeRO‑3 further shards the model parameters themselves, requiring AllGather to fetch remote parameters during forward passes and ReduceScatter to distribute gradients during backward passes.
Pipeline Parallelism
Pipeline parallelism partitions the model layer‑wise across GPUs; each GPU processes a consecutive set of layers, and subsequent layers must wait for the previous ones to finish before proceeding.
Mixed Precision Training
Using FP16 instead of FP32 speeds up computation and reduces memory usage, but the reduced dynamic range can cause underflow, especially when multiplying gradients by the learning rate.
Mixed‑precision training keeps a FP32 master copy of the optimizer parameters while storing model weights and gradients in FP16.
Checkpointing
Since back‑propagation requires intermediate activations, checkpointing saves only a subset of hidden states as checkpoints and releases the rest. When needed, the missing activations are recomputed by re‑executing the forward pass from the nearest checkpoint.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.