Artificial Intelligence 9 min read

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

Rare Earth Juejin Tech Community

May 10, 2024

GPU Memory Analysis

During fine‑tuning, GPU memory consumption consists of four parts: model parameters, parameter gradients, optimizer state, and intermediate activations.

For a 6‑billion‑parameter model, the memory occupied by the model parameters serves as a baseline; the gradient memory is equal to the parameter memory.

The optimizer typically uses Adam, which stores two additional buffers (m and v) of the same size as the gradients, thus requiring twice the memory of the gradients for optimizer state.

Intermediate results, shaped as [Batch, SeqLen, Dim], must also be kept in GPU memory for back‑propagation.

Collective Operations

To save memory, models or data can be distributed across multiple GPUs using various collective communication primitives.

Broadcast

The Broadcast operation copies an N‑element buffer from the root rank to all other ranks.

AllReduce, Reduce, ReduceScatter

AllReduce performs a reduction (e.g., sum, min, max) across devices and writes the result to every rank.

Reduce performs the same reduction but writes the result only to a designated root rank.

ReduceScatter performs the reduction and then scatters equal chunks of the result to each rank.

AllGather

AllGather collects N values from k ranks into an output of size k × N and distributes that result to all ranks.

Data Parallelism

Data parallelism splits the training data across multiple nodes, each holding a replica of the model.

Parameter server stores the global model parameters.

Parameters are copied to each device, forming replicas that process a subset of the data.

Each device computes gradients and performs a Reduce operation to obtain the final gradient, which updates the parameter server.

During back‑propagation, Reduce can be applied after each layer to improve parallelism.

Distributed Data Parallel (DDP)

DDP removes the parameter server; each replica holds a full copy of the model.

Each replica processes a portion of the data, performing forward and backward passes.

After computing gradients, an AllReduce synchronizes them across all replicas, and each replica updates its local parameters.

Model Parallelism

When models become too large for a single device, the parameter matrix can be partitioned into sub‑matrices that are processed on different GPUs, reducing per‑GPU memory usage.

Split the parameter matrix into several sub‑matrices and distribute them to different devices.

Each device computes on its sub‑matrix and the partial results are gathered.

ZeRO

ZeRO (Zero Redundancy Optimizer) reduces memory redundancy in distributed data parallel training.

ZeRO‑1 shards optimizer states across devices.

Each replica processes a portion of the input.

Forward pass is performed independently.

Backward pass is performed independently.

After obtaining the full gradient, a ReduceScatter distributes gradient shards to the corresponding replicas.

Each replica updates the parameters corresponding to its gradient shard.

AllGather synchronizes the updated parameters across all replicas.

ZeRO‑2 extends ZeRO‑1 by performing ReduceScatter after each layer’s gradient computation, allowing each replica to keep only a subset of gradients.

ZeRO‑3 further shards the model parameters themselves, requiring AllGather to fetch remote parameters during forward passes and ReduceScatter to distribute gradients during backward passes.

Pipeline Parallelism

Pipeline parallelism partitions the model layer‑wise across GPUs; each GPU processes a consecutive set of layers, and subsequent layers must wait for the previous ones to finish before proceeding.

Mixed Precision Training

Using FP16 instead of FP32 speeds up computation and reduces memory usage, but the reduced dynamic range can cause underflow, especially when multiplying gradients by the learning rate.

Mixed‑precision training keeps a FP32 master copy of the optimizer parameters while storing model weights and gradients in FP16.

Checkpointing

Since back‑propagation requires intermediate activations, checkpointing saves only a subset of hidden states as checkpoints and releases the rest. When needed, the missing activations are recomputed by re‑executing the forward pass from the nearest checkpoint.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed training ZeRO checkpointing collective operations GPU memory mixed precision Pipeline Parallel

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.