Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism
This article details a comprehensive set of techniques—including data‑ and tensor‑parallel overlap, context‑parallelism, activation rematerialization, and a performance‑driven cost model—that dramatically improve large‑language‑model training efficiency on ultra‑large GPU clusters while preserving model quality.
Background
FastAI’s AIP team summarizes a large‑scale LLM training solution that achieves significant throughput gains over state‑of‑the‑art open‑source methods without sacrificing model performance. The work was published at USENIX ATC ’24 and the code is open‑sourced on GitHub.
Distributed Training Challenges
Training massive models on huge clusters faces three main difficulties: (1) communication overhead dominates as model and cluster size grow, (2) resource contention between communication and computation reduces efficiency, and (3) memory pressure from activations limits feasible batch and sequence sizes.
Key Solutions
DP Overlap Inspired by ZeRO‑3, all‑gather and reduce‑scatter operations are overlapped with forward and backward passes, reducing idle time for communication.
TP Overlap Tensor‑parallel communication (all‑gather, reduce‑scatter) is split into dependent and independent parts; independent parts are overlapped using multiple CUDA streams, while dependent parts use split‑pipeline overlap.
Context Parallelism (CP) Activations are partitioned along the sequence dimension, drastically lowering communication volume compared to tensor parallelism and enabling efficient scaling to very long contexts.
Activation Rematerialization Two strategies are introduced: (a) Pipeline‑aware offloading moves pipeline activations to host memory with negligible latency, and (b) Compute‑Memory Balanced Checkpointing finds a Pareto‑optimal trade‑off between recomputation cost and memory savings, including a GEMM‑last recompute technique that cuts activation memory by ~40% with <2% extra compute.
Cost Model & Optimal Parallelism Search A performance model that captures model‑level and cluster‑level characteristics (e.g., per‑layer forward/backward times, network bandwidth) is built from a few measurements; it predicts MFU within 2‑5% error and guides the selection of the top‑5 parallel configurations for final testing.
Results
On a 256‑GPU H800 cluster training a 175B model with a 32K context window, the proposed methods raise MFU from 32.3% to 42.7%, a >30% throughput improvement over the best open‑source baseline. The approach also supports arbitrary context lengths and scales efficiently with more GPUs.
Future Directions
Training trillion‑parameter Mixture‑of‑Experts models.
Extending context windows to the million‑token scale.
Developing efficient RLHF frameworks.
Exploring low‑precision training (FP8/FP6) on Hopper‑class hardware.
Integrating heterogeneous accelerators for more flexible training pipelines.
References
Paper: Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (USENIX ATC ’24). Code: https://github.com/kwai/Megatron-Kwai
Kuaishou Large Model
Official Kuaishou Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.