How Sequence Parallelism Slashes Activation Memory in Megatron Training
This article provides a detailed technical walkthrough of sequence parallelism (SP) for Megatron models, covering tensor parallelism basics, precise activation memory calculations for MLP and attention layers, the SP implementation that splits activations across GPUs, and selective activation recomputation strategies that further reduce memory while preserving training speed.
1. Tensor Parallelism (TP) Overview
Tensor parallelism splits the weight matrices of the attention and MLP sub‑layers across multiple GPUs. In Megatron, each GPU holds a full copy of the input, computes its assigned slice of the matrix multiplication, and then synchronises outputs with an AllReduce so that every GPU has the complete result for the next block.
1.1 MLP Layer TP Details
The MLP consists of two linear projections A (shape h × 4h ) and B (shape 4h × h ). TP applies a column split to A and a row split to B. The forward pass on each GPU computes its local slice ( f forward) and then performs an AllReduce on the intermediate result ( g forward) to obtain the full activation Z. During the backward pass, gradients are computed locally and an AllReduce is used again to aggregate them.
1.2 Attention Layer TP Details
For the attention sub‑layer, the query, key, and value matrices (Q, K, V) are column‑split so that each GPU handles one or several heads, while the output projection B is row‑split. The forward and backward flows mirror those of the MLP, using AllReduce to share intermediate results.
2. Activation Memory Analysis
Activations dominate GPU memory because they must be stored for the backward pass. Recomputing them saves memory but adds extra forward work, which can hurt throughput. The goal is to keep activations on‑chip without costly recomputation.
2.1 MLP Activation Size
Assuming fp16 (2 bytes per element) and variables b (batch size), s (sequence length), h (hidden size):
Input LayerNorm activation: 2 b s h bytes
A’s input activation: 2 b s h bytes
B’s input activation: 8 b s h bytes
GELU input activation: 8 b s h bytes
Dropout mask (1 byte per element): b s h bytes
Total MLP activation size = 19 b s h bytes.
2.2 Attention Activation Size
Similar breakdown yields:
Input LayerNorm: 2 b s h Input X: 2 b s h Q, K, V after linear projection: 6 b s h Softmax output and related intermediates (sizes omitted for brevity)
Dropout mask: b s h The sum gives the total attention activation size (the exact formula follows the same pattern as the MLP case).
2.3 Summary of Activation Sizes
Both MLP and attention layers share the same LayerNorm‑related activations; the combined activation memory for a single block is the sum of the two totals.
3. Megatron Sequence Parallelism (SP)
SP extends TP by also splitting activations along the sequence dimension, eliminating the redundant 5 b s h bytes that TP stores on every GPU.
3.1 Overall SP Design
Before SP, only weight matrices are partitioned (TP). After adding SP, the inputs and outputs of the attention and MLP sub‑layers are partitioned per seq_chunk . Diagrams (omitted) show the before/after layouts.
3.2 MLP Layer with TP+SP
Steps:
LayerNorm input X is split by sequence; each GPU stores only its seq_chunk .
All‑gather is performed to reconstruct the full input before the TP computation.
TP proceeds as usual; each GPU now holds only a slice of the output Z.
Instead of an AllReduce, a reduce‑scatter distributes the appropriate seq_chunk of Z to each GPU, avoiding duplication.
Dropout is applied locally, further reducing stored mask size.
During backward, an all‑gather retrieves the needed full activations, while a reduce‑scatter sends the gradient slices back, overlapping communication with computation.
The net effect is a reduction of per‑GPU activation memory from the TP baseline by the previously duplicated 5 b s h bytes, with communication cost identical to pure TP (2 all‑gather + 2 reduce‑scatter).
3.3 Attention Layer with TP+SP
The attention sub‑layer follows the same pattern; after splitting the sequence, the per‑GPU activation memory drops by the same amount, and communication remains balanced.
3.4 Quantitative Summary
Without any parallelism: activation size = ... (baseline) Pure TP on t GPUs: activation size per GPU = (total‑10)/t + 10 (the “10” corresponds to LayerNorm‑related activations).
TP+SP on t GPUs: activation size per GPU = (total‑10)/t + (10/t), i.e., the redundant part is also divided.
4. Selective Activation Recomputation
Even with TP+SP, some activations may still exceed memory limits. Selective activation recomputation keeps only the large, cheap‑to‑recompute activations (e.g., attention scores after softmax) and discards the rest, recomputing them during the backward pass. This hybrid approach (TP + SP + selective recomputation) yields the best overall performance, as shown by the experimental results image.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
