Artificial Intelligence 20 min read

How Sequence Parallelism Slashes Activation Memory in Megatron Training

This article provides a detailed technical walkthrough of sequence parallelism (SP) for Megatron models, covering tensor parallelism basics, precise activation memory calculations for MLP and attention layers, the SP implementation that splits activations across GPUs, and selective activation recomputation strategies that further reduce memory while preserving training speed.

Baobao Algorithm Notes

Oct 30, 2024

How Sequence Parallelism Slashes Activation Memory in Megatron Training

1. Tensor Parallelism (TP) Overview

Tensor parallelism splits the weight matrices of the attention and MLP sub‑layers across multiple GPUs. In Megatron, each GPU holds a full copy of the input, computes its assigned slice of the matrix multiplication, and then synchronises outputs with an AllReduce so that every GPU has the complete result for the next block.

1.1 MLP Layer TP Details

The MLP consists of two linear projections A (shape h × 4h ) and B (shape 4h × h ). TP applies a column split to A and a row split to B. The forward pass on each GPU computes its local slice ( f forward) and then performs an AllReduce on the intermediate result ( g forward) to obtain the full activation Z. During the backward pass, gradients are computed locally and an AllReduce is used again to aggregate them.

1.2 Attention Layer TP Details

For the attention sub‑layer, the query, key, and value matrices (Q, K, V) are column‑split so that each GPU handles one or several heads, while the output projection B is row‑split. The forward and backward flows mirror those of the MLP, using AllReduce to share intermediate results.

2. Activation Memory Analysis

Activations dominate GPU memory because they must be stored for the backward pass. Recomputing them saves memory but adds extra forward work, which can hurt throughput. The goal is to keep activations on‑chip without costly recomputation.

2.1 MLP Activation Size

Assuming fp16 (2 bytes per element) and variables b (batch size), s (sequence length), h (hidden size):

Input LayerNorm activation: 2 b s h bytes

A’s input activation: 2 b s h bytes

B’s input activation: 8 b s h bytes

GELU input activation: 8 b s h bytes

Dropout mask (1 byte per element): b s h bytes

Total MLP activation size = 19 b s h bytes.

2.2 Attention Activation Size

Similar breakdown yields:

Input LayerNorm: 2 b s h Input X: 2 b s h Q, K, V after linear projection: 6 b s h Softmax output and related intermediates (sizes omitted for brevity)

Dropout mask: b s h The sum gives the total attention activation size (the exact formula follows the same pattern as the MLP case).

2.3 Summary of Activation Sizes

Both MLP and attention layers share the same LayerNorm‑related activations; the combined activation memory for a single block is the sum of the two totals.

3. Megatron Sequence Parallelism (SP)

SP extends TP by also splitting activations along the sequence dimension, eliminating the redundant 5 b s h bytes that TP stores on every GPU.

3.1 Overall SP Design

Before SP, only weight matrices are partitioned (TP). After adding SP, the inputs and outputs of the attention and MLP sub‑layers are partitioned per seq_chunk . Diagrams (omitted) show the before/after layouts.

3.2 MLP Layer with TP+SP

Steps:

LayerNorm input X is split by sequence; each GPU stores only its seq_chunk .

All‑gather is performed to reconstruct the full input before the TP computation.

TP proceeds as usual; each GPU now holds only a slice of the output Z.

Instead of an AllReduce, a reduce‑scatter distributes the appropriate seq_chunk of Z to each GPU, avoiding duplication.

Dropout is applied locally, further reducing stored mask size.

During backward, an all‑gather retrieves the needed full activations, while a reduce‑scatter sends the gradient slices back, overlapping communication with computation.

The net effect is a reduction of per‑GPU activation memory from the TP baseline by the previously duplicated 5 b s h bytes, with communication cost identical to pure TP (2 all‑gather + 2 reduce‑scatter).

3.3 Attention Layer with TP+SP

The attention sub‑layer follows the same pattern; after splitting the sequence, the per‑GPU activation memory drops by the same amount, and communication remains balanced.

3.4 Quantitative Summary

Without any parallelism: activation size = ... (baseline) Pure TP on t GPUs: activation size per GPU = (total‑10)/t + 10 (the “10” corresponds to LayerNorm‑related activations).

TP+SP on t GPUs: activation size per GPU = (total‑10)/t + (10/t), i.e., the redundant part is also divided.

4. Selective Activation Recomputation

Even with TP+SP, some activations may still exceed memory limits. Selective activation recomputation keeps only the large, cheap‑to‑recompute activations (e.g., attention scores after softmax) and discards the rest, recomputing them during the backward pass. This hybrid approach (TP + SP + selective recomputation) yields the best overall performance, as shown by the experimental results image.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Tensor Parallelism sequence parallelism Megatron activation memory selective recomputation

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.