Artificial Intelligence 15 min read

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

The article analyzes ByteDance and Peking University's MegaScale system that enables efficient, stable training of large language models on clusters exceeding ten thousand GPUs, detailing algorithmic tweaks, 3D parallel communication overlap, operator optimizations, data‑pipeline improvements, network tuning, and fault‑tolerance mechanisms that together achieve a 55.2% MFU on a 175B model.

Architects' Tech Alliance

Apr 6, 2024

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

Background

ByteDance, together with researchers from Peking University, published a paper describing a production‑grade system—MegaScale—designed to train large language models (LLMs) on clusters larger than ten thousand GPUs. The work addresses the efficiency and stability challenges that arise at such massive scale.

Challenges of 10k‑GPU Clusters

Two primary challenges are identified: (1) achieving high training efficiency, measured by model‑floating‑point‑utilization (MFU), which depends on operator performance, data preprocessing, and communication overhead; and (2) maintaining training stability, as failures and delays become costly when thousands of GPUs are involved.

MegaScale System Design

Algorithmic Optimizations

The system incorporates several algorithmic improvements that preserve model accuracy while boosting speed:

Parallel Transformer blocks : a parallel version of the transformer layer allows attention and MLP blocks to run concurrently, reducing compute time without degrading quality.

Sliding‑Window Attention (SWA) : a sparse attention mechanism that limits each token’s context to a fixed‑size window, offering comparable context coverage with lower compute cost.

LAMB optimizer : enables batch sizes up to 64K for BERT‑style models without loss of convergence, alleviating batch‑size bottlenecks.

3D Parallel Communication Overlap

MegaScale employs a 3D parallelism strategy (tensor, pipeline, and data parallelism). It overlaps all‑gather and reduce‑scatter communications with computation by pre‑fetching all‑gather at the start of each iteration and decoupling send/receive operations, reducing visible communication time.

Efficient Operators

Key operators are further optimized: FlashAttention‑2 replaces the standard attention kernel, while LayerNorm and GeLU are re‑implemented with fine‑grained kernels. Kernel fusion reduces launch overhead and improves memory access patterns.

Data‑Pipeline Optimizations

Data loading is made asynchronous so that preprocessing runs off the critical path. Redundant per‑GPU data loaders are eliminated by using a single loader per machine that reads data into shared memory, after which each GPU copies only the needed portion, cutting I/O contention.

Collective Communication Group Initialization

Standard torch.distributed initialization becomes a bottleneck beyond 2,048 GPUs. MegaScale replaces the TCPStore barrier with a non‑blocking Redis store and redesigns the group‑initialization order to minimize global barriers, dropping initialization time from 1,047 s to under 5 s for 2,048 GPUs and to under 30 s for >10k GPUs.

Network Performance Tuning

The data‑center network uses Broadcom Tomahawk 4 switches (25.6 Tbps total bandwidth, 64 × 400 Gbps ports) in a three‑layer CLOS topology. Techniques applied include:

Separating uplink and downlink on ToR switches and using AOC cables to halve hash collisions.

Implementing a hybrid congestion‑control algorithm that combines Swift and DCQCN, using precise RTT measurements and ECN for rapid response.

Adjusting NCCL retransmission timers and enabling NIC‑level adaptive retransmission to reduce recovery time during link jitter.

Fault‑Tolerance Mechanisms

To handle inevitable hardware and software failures at scale, MegaScale integrates a robust training framework that automatically detects node failures via heartbeat messages from each pod. Upon detection, the driver submits the faulty node’s IP to a custom Kubernetes controller, which evicts the node and replaces it with a healthy one. Checkpointing is optimized for rapid restoration, and a millisecond‑level monitoring system tracks key metrics to aid root‑cause analysis.

Results and Conclusion

When training a 175‑billion‑parameter LLM on 12,288 GPUs, MegaScale achieved a 55.2 % MFU, a 1.34× improvement over Megatron‑LM. The paper demonstrates that co‑design of algorithms and system infrastructure can substantially increase training efficiency and stability, providing practical insights for future large‑scale LLM research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems performance optimization fault tolerance LLM training GPU clusters MegaScale

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.