Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters
This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.
The Kuaishou AIP team distilled a set of best‑practice techniques for training large language models (LLMs) on ultra‑large GPU clusters, originally presented at QCon and published in USENIX ATC ’24. The work focuses on maintaining model performance while significantly improving training efficiency.
Key challenges include the prohibitive memory footprint of activations, the communication overhead of data, tensor, and pipeline parallelism, and the difficulty of tuning hybrid parallel configurations.
Solutions are organized into three pillars:
Communication‑compute overlap for data parallelism (DP) and tensor parallelism (TP) using ZeRO‑3‑inspired all‑gather/reduce‑scatter scheduling.
Context parallelism (CP) that partitions the sequence dimension to reduce activation memory and communication volume, combined with grouped‑query attention to further shrink KV activations.
Memory‑efficient activation strategies such as GEMM‑last recomputing, selective checkpointing, and pipeline‑aware offloading that trade minimal extra compute for large memory savings.
Additional engineering optimizations address SM resource contention, channel‑level communication tuning, and bucketed all‑gather to mitigate network congestion.
A lightweight performance model captures both model‑level (forward/backward timings) and cluster‑level (bandwidth, channel count) characteristics, enabling rapid exploration of the massive parallel‑parameter space and accurate MFU predictions (within 2‑5% error).
Experimental results on a 256‑GPU H800 cluster show MFU improvements from 32.3% to 42.7% for a 175B model with a 32K context window, and over 30% throughput gains across arbitrary context lengths.
The paper also outlines future directions: trillion‑parameter MoE models, million‑token context windows, efficient RLHF pipelines, low‑precision training (FP8/FP6), and heterogeneous accelerator integration.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.