Artificial Intelligence 13 min read

Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training

Researchers from Kwai’s large-model team present a novel training system that combines pipeline-parallel-aware activation offloading with a compute-memory balanced checkpointing strategy, enabling lossless acceleration of large language models, achieving up to 42.7% MFU on 256 NVIDIA H800 GPUs while reducing memory usage.

Kuaishou Large Model

Jul 11, 2024

Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training

Paper Overview

Training large language models (LLMs) requires massive compute and memory. Kwai’s large‑model team proposes pipeline‑parallel‑aware activation offloading and a compute‑memory balanced checkpointing strategy to accelerate training without loss.

Paper title: Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism

Paper URL: https://www.usenix.org/conference/atc24/presentation/yuan

Code URL: https://github.com/kwai/Megatron-Kwai

Core Contributions

Pipeline‑Parallel‑Aware Offloading : schedules activation offload/reload to use host memory with negligible overhead.

Compute‑Memory Balanced Checkpointing : finds a Pareto‑optimal trade‑off between activation size and recomputation cost.

Performance Modeling & Parallel Configuration Optimization : builds a cost model from a few basic measurements to select the optimal hybrid parallel configuration (tensor, context, pipeline, data).

Background

LLM training faces two main challenges: activation memory bottleneck and difficulty of tuning the large space of parallel configurations.

Method Overview

Activation Offloading in Pipeline Parallelism

Pipeline parallelism consists of warm‑up, steady, and cooldown stages. Activations generated in the warm‑up stage are stored in host memory until needed in the steady stage, reducing GPU memory pressure.

The offload starts immediately after each micro‑batch forward, and reload begins when the corresponding backward computation starts. The scheme operates at the pipeline‑stage granularity, allowing compute and transfer to overlap.

Compute‑Memory Balanced Checkpointing

Traditional full checkpointing saves only inputs, halving memory but doubling compute. The proposed method enumerates recomputation costs for each activation, builds a Pareto frontier, and selects checkpointing points that reduce activation size from 37.3 GB to 22.7 GB (39 % saving) with only 1.5 % extra compute.

Performance Modeling & Parallel Configuration Search

A few basic performance measurements (model‑related forward/backward/recompute times, cluster bandwidths) are used to build a cost model. Enumerating all hybrid parallel configurations and selecting the one with minimal predicted iteration time yields the optimal setup in under 0.001 s.

Experimental Setup

Hardware: 32 nodes, each with 8 NVIDIA H800 GPUs and 1 TB host memory; NVLink intra‑node, 100 Gbps inter‑node.

Software: Baseline Megatron‑LM (2024.01.01) with improvements in context parallelism and RoPE; the proposed code adds offloading and checkpointing.

Models: Llama‑65B, Llama‑2‑70B (GQA), Llama‑175B with context lengths 4 k–128 k, global batch size 256.

Results

On 256 H800 GPUs with a 32 k context window, MFU increased from 32.3 % to 42.7 %.

Performance modeling accuracy stays within 2 % error across various parallel parameters and checkpointing methods.

End‑to‑end comparisons show the proposed system outperforms the latest Megatron‑LM while preserving loss curves, confirming compatibility with GQA and all 4‑D parallelism.

Scaling experiments demonstrate the model‑based optimizer adapts to cluster size changes, achieving higher throughput than manual DP scaling.

Conclusion

The paper introduces two activation‑reconstruction techniques—pipeline‑parallel‑aware offloading and compute‑memory balanced checkpointing—and an optimal parallel configuration solver based on a lightweight performance model, enabling efficient, scalable LLM training with open‑source code.

Code and Docker images are publicly available on GitHub to facilitate reproducibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models GPU training Performance Modeling hybrid parallelism checkpointing activation offloading Kwai

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.