Token‑Level Pipeline Parallelism for Transformer‑based Language Models (TeraPipe)
The article introduces a token‑level pipeline parallelism strategy that splits the sequence‑length dimension of Transformer‑based language models, explains why this approach is feasible, presents a dynamic‑programming formulation for optimal slicing, discusses engineering challenges, and evaluates its performance on large GPT models.
Preface
Large models are constantly trending, and parallel training strategies beyond basic data parallelism—such as operator partitioning (Megatron‑LM, Mesh TensorFlow) and pipeline parallelism (GPipe, PipeDream, DAPPLE)—are gaining attention. These strategies are theoretically orthogonal and can be combined for greater than additive speedups, offering generality across model families.
However, over‑emphasizing generality can miss opportunities. Most large models are Transformer‑based language models, which share a fixed architecture and stacked layers. By sacrificing a bit of universality and focusing on the specific structure of Transformer‑LMs, specialized parallelism can bring product value.
This article proposes a new pipeline parallelism mode for Transformer‑based LMs: a token‑level pipeline that partitions the sequence length dimension instead of the batch dimension.
Batch‑splitting Pipeline Parallelism Defects
Synchronous pipeline parallelism introduces bubbles; the bubble ratio determines the theoretical efficiency ceiling. Reducing bubbles usually requires more micro‑batches, which can hurt convergence or, when memory limits force tiny micro‑batch sizes, lead to low GPU utilization.
Because larger models require longer sequences, the article suggests cutting the sequence‑length dimension so that even a single long‑sequence sample can be pipelined, reducing bubble proportion and improving efficiency.
Why Token‑Level Splitting Is Feasible
Transformer‑based LMs consist of Self‑Attention and Feed‑Forward Network layers. At any time step t , the Self‑Attention computation depends only on tokens up to t , and the Feed‑Forward computation depends only on the current token. This property allows different layers to process different time steps concurrently, enabling token‑level pipeline parallelism (illustrated in the article’s figures).
Challenges of Token‑Level Splitting
Uniformly splitting the sequence length does not yield uniform computational load because later time steps involve more work. Therefore, the goal is to achieve uniform compute per slice, not uniform length. Finding the optimal cut points requires a search, which the author solves using dynamic programming (DP), similar to the approach in the Alpa paper.
Dynamic‑Programming Formulation
The model is first partitioned into Transformer cells (layers). The input sequence is divided into sub‑sequence slices, each with an associated forward time cost. The pipeline latency is expressed as a sum of compute times across slices and cells. The DP enumerates possible cut points to minimize total latency, ignoring communication costs (which empirically does not hurt the solution).
Images in the original article illustrate the DP equations and complexity analysis (the images are retained below).
Important Engineering Issues
Pruning: Similar to Alpa, enumerate cut points and stop when the current cost exceeds the best found.
Cost Estimation: The forward time of each slice is easy to measure; the second term (extra context) requires a simple model to fit using a subset of data.
Fusion with Other Strategies
Operator partitioning and data parallelism are orthogonal and can be combined with token‑level pipeline parallelism. Micro‑batch‑based pipeline parallelism can also coexist because splitting the batch does not affect sequence‑length splitting; the article shows a diagram where different colors represent different micro‑batches.
Evaluation
E2E Performance: Latency per iteration is measured on GPT‑style models of 1B, 13B, 44B, and 175B parameters with a fixed sequence length of 2048. The DP algorithm searches for the best combination of batch and token dimension splits. Results show that TeraPipe’s advantage grows with model size because larger models have smaller batch sizes, making batch‑dimension pipelines less efficient while token‑dimension pipelines retain high utilization.
Non‑uniform vs Uniform Slicing: Uniform length slicing leads to non‑uniform compute, increasing bubble time. DP finds near‑optimal latency by balancing compute across slices.
Longer Sequence Length: Experiments confirm that as sequence length grows, TeraPipe’s speedup becomes more pronounced.
Some Reflections
1. Does token‑level splitting require special user code? Yes, because expressing the “no‑future” dependency in self‑attention typically needs masking; the runtime cannot infer it automatically.
2. How do batch and sequence‑length splits differ? Batch splits produce independent micro‑batches, while sequence splits create dependent sub‑graphs because each token depends on previous tokens.
3. Is a model‑specific strategy like TeraPipe valuable? Given the prevalence of Transformer‑based LMs, a specialized parallelism technique offers significant practical value.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.