Artificial Intelligence 28 min read

Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism

This article details a comprehensive set of techniques—including data‑ and tensor‑parallel overlap, context‑parallelism, activation rematerialization, and a performance‑driven cost model—that dramatically improve large‑language‑model training efficiency on ultra‑large GPU clusters while preserving model quality.

Kuaishou Large Model

Nov 22, 2024

Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism

Background

FastAI’s AIP team summarizes a large‑scale LLM training solution that achieves significant throughput gains over state‑of‑the‑art open‑source methods without sacrificing model performance. The work was published at USENIX ATC ’24 and the code is open‑sourced on GitHub.

Distributed Training Challenges

Training massive models on huge clusters faces three main difficulties: (1) communication overhead dominates as model and cluster size grow, (2) resource contention between communication and computation reduces efficiency, and (3) memory pressure from activations limits feasible batch and sequence sizes.

Key Solutions

DP Overlap Inspired by ZeRO‑3, all‑gather and reduce‑scatter operations are overlapped with forward and backward passes, reducing idle time for communication.

TP Overlap Tensor‑parallel communication (all‑gather, reduce‑scatter) is split into dependent and independent parts; independent parts are overlapped using multiple CUDA streams, while dependent parts use split‑pipeline overlap.

Context Parallelism (CP) Activations are partitioned along the sequence dimension, drastically lowering communication volume compared to tensor parallelism and enabling efficient scaling to very long contexts.

Activation Rematerialization Two strategies are introduced: (a) Pipeline‑aware offloading moves pipeline activations to host memory with negligible latency, and (b) Compute‑Memory Balanced Checkpointing finds a Pareto‑optimal trade‑off between recomputation cost and memory savings, including a GEMM‑last recompute technique that cuts activation memory by ~40% with <2% extra compute.

Cost Model & Optimal Parallelism Search A performance model that captures model‑level and cluster‑level characteristics (e.g., per‑layer forward/backward times, network bandwidth) is built from a few measurements; it predicts MFU within 2‑5% error and guides the selection of the top‑5 parallel configurations for final testing.

Results

On a 256‑GPU H800 cluster training a 175B model with a 32K context window, the proposed methods raise MFU from 32.3% to 42.7%, a >30% throughput improvement over the best open‑source baseline. The approach also supports arbitrary context lengths and scales efficiently with more GPUs.

Future Directions

Training trillion‑parameter Mixture‑of‑Experts models.

Extending context windows to the million‑token scale.

Developing efficient RLHF frameworks.

Exploring low‑precision training (FP8/FP6) on Hopper‑class hardware.

Integrating heterogeneous accelerators for more flexible training pipelines.

References

Paper: Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (USENIX ATC ’24). Code: https://github.com/kwai/Megatron-Kwai

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models distributed training parallelism Performance Modeling activation recomputation

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.