Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training
Researchers from Kwai’s large-model team present a novel training system that combines pipeline-parallel-aware activation offloading with a compute-memory balanced checkpointing strategy, enabling lossless acceleration of large language models, achieving up to 42.7% MFU on 256 NVIDIA H800 GPUs while reducing memory usage.
Paper Overview
Training large language models (LLMs) requires massive compute and memory. Kwai’s large‑model team proposes pipeline‑parallel‑aware activation offloading and a compute‑memory balanced checkpointing strategy to accelerate training without loss.
Paper title: Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
Paper URL: https://www.usenix.org/conference/atc24/presentation/yuan
Code URL: https://github.com/kwai/Megatron-Kwai
Core Contributions
Pipeline‑Parallel‑Aware Offloading : schedules activation offload/reload to use host memory with negligible overhead.
Compute‑Memory Balanced Checkpointing : finds a Pareto‑optimal trade‑off between activation size and recomputation cost.
Performance Modeling & Parallel Configuration Optimization : builds a cost model from a few basic measurements to select the optimal hybrid parallel configuration (tensor, context, pipeline, data).
Background
LLM training faces two main challenges: activation memory bottleneck and difficulty of tuning the large space of parallel configurations.
Method Overview
Activation Offloading in Pipeline Parallelism
Pipeline parallelism consists of warm‑up, steady, and cooldown stages. Activations generated in the warm‑up stage are stored in host memory until needed in the steady stage, reducing GPU memory pressure.
The offload starts immediately after each micro‑batch forward, and reload begins when the corresponding backward computation starts. The scheme operates at the pipeline‑stage granularity, allowing compute and transfer to overlap.
Compute‑Memory Balanced Checkpointing
Traditional full checkpointing saves only inputs, halving memory but doubling compute. The proposed method enumerates recomputation costs for each activation, builds a Pareto frontier, and selects checkpointing points that reduce activation size from 37.3 GB to 22.7 GB (39 % saving) with only 1.5 % extra compute.
Performance Modeling & Parallel Configuration Search
A few basic performance measurements (model‑related forward/backward/recompute times, cluster bandwidths) are used to build a cost model. Enumerating all hybrid parallel configurations and selecting the one with minimal predicted iteration time yields the optimal setup in under 0.001 s.
Experimental Setup
Hardware: 32 nodes, each with 8 NVIDIA H800 GPUs and 1 TB host memory; NVLink intra‑node, 100 Gbps inter‑node.
Software: Baseline Megatron‑LM (2024.01.01) with improvements in context parallelism and RoPE; the proposed code adds offloading and checkpointing.
Models: Llama‑65B, Llama‑2‑70B (GQA), Llama‑175B with context lengths 4 k–128 k, global batch size 256.
Results
On 256 H800 GPUs with a 32 k context window, MFU increased from 32.3 % to 42.7 %.
Performance modeling accuracy stays within 2 % error across various parallel parameters and checkpointing methods.
End‑to‑end comparisons show the proposed system outperforms the latest Megatron‑LM while preserving loss curves, confirming compatibility with GQA and all 4‑D parallelism.
Scaling experiments demonstrate the model‑based optimizer adapts to cluster size changes, achieving higher throughput than manual DP scaling.
Conclusion
The paper introduces two activation‑reconstruction techniques—pipeline‑parallel‑aware offloading and compute‑memory balanced checkpointing—and an optimal parallel configuration solver based on a lightweight performance model, enabling efficient, scalable LLM training with open‑source code.
Code and Docker images are publicly available on GitHub to facilitate reproducibility.
Kuaishou Large Model
Official Kuaishou Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.