Artificial Intelligence 14 min read

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

Architect
Architect
Architect
Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

1. Introduction

Because a single GPU has limited memory, training large models faces two dimensions of problems: data dimension (a single GPU can hold the model but not an entire batch, requiring data parallelism) and model dimension (the model itself cannot fit on one GPU, requiring model parallelism such as TP/PP/EP/SP).

If the model is small enough to fit both the model and a batch on one GPU, no parallelism is needed. If the model fits but the batch does not, data parallelism (DP) alone is the most efficient. Only when the model cannot fit on a single GPU do we need additional parallel strategies.

DP, TP, PP, EP, and SP can be combined, but some dependencies exist: EP depends on DP, SP depends on TP.

2. Parallelism Schemes

2.1 Data Parallelism (DP)

DP replicates the same model on each GPU and splits the input data. After each GPU computes gradients on its shard, an all‑reduce operation aggregates the gradients before the optimizer updates the model. The communication cost is 2 M dtype, where M is the model size.

DP works best for small models; for larger models (e.g., llama‑7B with bf16) the memory required for model, gradients, and Adam optimizer exceeds the 80 GB of a typical H100, so TP/PP must be introduced.

2.2 Tensor Parallelism (TP)

TP splits the internal weight matrices of the model and performs block‑matrix multiplication. It enables a model to be spread across multiple GPUs but adds significant communication overhead.

Example for the embedding layer: each GPU stores vocab/N × hidden; after lookup, an all‑reduce of size 2 b s h is required (b=batch, s=seq_len, h=hidden).

Attention (MHA) and GQA are split along heads; output layers are split row‑wise, requiring an additional all‑reduce after multiplication.

MLP: the first matrix (h × 4h) is column‑split, the second (4h × h) is row‑split; an all‑reduce of 2 b s h follows.

Output layer (h × vocab): column‑split leads to an all‑reduce of b s tp; an optimization is to sum across vocab locally before the final all‑reduce, reducing traffic to b s tp.

Overall TP communication is O(b s h l) (l = number of layers) and is usually limited to a single node with NVLink because inter‑node bandwidth is insufficient.

2.3 Pipeline Parallelism (PP)

PP splits the model by layers (e.g., 80 layers across 8 GPUs, 10 layers per GPU). Communication cost is b s h pp, lower than TP, but introduces pipeline bubbles.

Micro‑batching and the 1F1B schedule reduce memory usage while keeping bubble rate unchanged. Virtual‑pipeline (interleaved 1F1B) further reduces bubbles by re‑ordering stages.

Zero‑bubble pipeline (2024) splits the backward pass into finer granularity (gradient vs. activation) and removes the global all‑reduce, achieving near‑zero bubbles at the cost of higher memory.

2.4 Sequence Parallelism (SP)

SP, as implemented in Megatron‑LM, partitions the input sequence across GPUs while keeping the same parameter partitions as TP. It reduces activation memory without adding extra communication compared to TP because the combined communication pattern (reduce‑scatter + all‑gather) matches TP’s all‑reduce.

2.5 Expert Parallelism (EP)

EP addresses the inefficiency of Mixture‑of‑Experts (MoE) models where each token is routed to top‑k experts. Experts are distributed across DP groups, requiring all‑to‑all communication for token routing and additional synchronization during forward and backward passes.

References:

[1] 李沐老师的视频: https://www.bilibili.com/video/BV1nB4y1R7Yz/?spm_id_from=333.1387.search.video_card.click
[2] zero‑bubble: https://arxiv.org/pdf/2401.10241

Images illustrating TP, 1F1B, zero‑bubble pipeline, SP, and EP are included in the original article.

Author: 哈密瓜 (source: zhihu.com)

Large Language Modelstensor parallelismpipeline parallelismsequence parallelismAI trainingmodel parallelismData Parallelismexpert parallelism
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.