Artificial Intelligence 27 min read

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.

Kuaishou Tech

Nov 21, 2024

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

The Kuaishou AIP team distilled a set of best‑practice techniques for training large language models (LLMs) on ultra‑large GPU clusters, originally presented at QCon and published in USENIX ATC ’24. The work focuses on maintaining model performance while significantly improving training efficiency.

Key challenges include the prohibitive memory footprint of activations, the communication overhead of data, tensor, and pipeline parallelism, and the difficulty of tuning hybrid parallel configurations.

Solutions are organized into three pillars:

Communication‑compute overlap for data parallelism (DP) and tensor parallelism (TP) using ZeRO‑3‑inspired all‑gather/reduce‑scatter scheduling.

Context parallelism (CP) that partitions the sequence dimension to reduce activation memory and communication volume, combined with grouped‑query attention to further shrink KV activations.

Memory‑efficient activation strategies such as GEMM‑last recomputing, selective checkpointing, and pipeline‑aware offloading that trade minimal extra compute for large memory savings.

Additional engineering optimizations address SM resource contention, channel‑level communication tuning, and bucketed all‑gather to mitigate network congestion.

A lightweight performance model captures both model‑level (forward/backward timings) and cluster‑level (bandwidth, channel count) characteristics, enabling rapid exploration of the massive parallel‑parameter space and accurate MFU predictions (within 2‑5% error).

Experimental results on a 256‑GPU H800 cluster show MFU improvements from 32.3% to 42.7% for a 175B model with a 32K context window, and over 30% throughput gains across arbitrary context lengths.

The paper also outlines future directions: trillion‑parameter MoE models, million‑token context windows, efficient RLHF pipelines, low‑precision training (FP8/FP6), and heterogeneous accelerator integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Training GPU clusters Performance Modeling activation rematerialization hybrid parallelism

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.