Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM
Baidu Baige’s AIAK‑LLM suite accelerates large‑model training and inference by boosting Model FLOPS Utilization through techniques such as TP communication overlap, hybrid recompute, zero‑offload, automatic parallel‑strategy search, multi‑chip support, and inference‑specific optimizations, achieving over 60 % speedup and seamless Hugging Face integration.
This article summarizes the public talk “Baige AIAK‑LLM: Large Model Training and Inference Acceleration Practice” from the 2024 Baidu Create Conference (April 16). The speaker focuses on AI infrastructure (AI Infra) challenges and introduces Baidu’s AIAK‑LLM acceleration suite.
The presentation is divided into four parts:
Discussion of the challenges that large models bring to underlying infrastructure.
Introduction of the key performance metric Model FLOPS Utilization (MFU) and industry techniques to improve it.
Case studies from Baidu Baige AIAK‑LLM that raise MFU to a high level.
Product‑level overview of capabilities and design philosophy.
Background and demand
Large models are growing rapidly—model size typically increases an order of magnitude every 1‑2 years, and data volume grows proportionally. Training such models requires massive compute resources (e.g., 2 048 A100 GPUs for Llama‑1 65B) and leads to high training cost and scalability issues (e.g., GPT‑4 reportedly used 10 000‑25 000 GPUs, GPT‑5 may need 50 000 H100 GPUs). Consequently, AI infrastructure must be ultra‑large, stable, and efficient.
MFU (Model FLOPS Utilization)
MFU is the ratio of actual FLOPS achieved during training/inference to the theoretical peak FLOPS of the chip. For example, if an A800 GPU delivers 100 TFLOPS while its peak is 315 TFLOPS, the MFU is 32 %.
Ideal MFU values (based on GEMM‑only workloads) are around 80‑85 % on A800. Real workloads suffer from non‑GEMM ops, communication, and scheduling, which multiply to reduce MFU.
Rough targets: ~75 %+ MFU for training and ~30 %+ MFU for inference on medium‑scale clusters.
Baidu Baige three‑layer architecture
Resource layer: large‑scale, stable, cost‑effective compute and storage resources suitable for heterogeneous computing.
Component layer: scheduling, fault‑tolerance, and other components that support massive training and inference tasks.
Acceleration layer: rich compute/I‑O acceleration capabilities that maximize hardware efficiency.
The talk then dives into the core techniques of the AIAK‑LLM suite.
Training optimizations
TP communication overlap : Split backward‑gradient computation and communication, overlap GEMM with All‑Reduce, and further split GEMM into smaller chunks to hide communication latency.
Hybrid recompute : Combine full‑block and selective recompute strategies to balance memory saving and extra compute.
Zero‑offload : Offload optimizer/parameters/gradients to CPU memory and bring them back on‑demand, reducing the need for recompute and freeing GPU memory for larger TP/PP configurations.
Automatic parallel‑strategy search : Build performance models for compute, memory, and communication; enumerate all possible parallel configurations; predict execution time; and select the optimal configuration within minutes.
Multi‑chip enablement : Abstract accelerators to support heterogeneous chips (GPU, Kunlun, Ascend), enable mixed‑chip training, and handle cross‑chip communication challenges.
These optimizations raised training MFU from ~30 % (baseline) to over 60 % on 32‑256‑card clusters, with up to 30 %+ overall speedup across model/scale variations.
Inference optimizations
Reduce token‑gap latency by moving sampling, stop‑check, and other post‑processing steps onto the GPU and parallelizing them.
Parallelize “to‑text”, “to‑client”, and scheduler operations using multi‑process pipelines.
Rewrite the scheduler in C++ to allow slot‑level parallelism, eliminating per‑sequence bottlenecks.
Address low GEMM MFU in the decoder by using small‑model pre‑generation (token‑splitting) to increase batch size per GEMM.
Adopt KV‑Cache and PagedAttention to avoid redundant attention computation and to manage variable‑length sequences efficiently.
Switch from 2‑D padded sequence layout to 1‑D concatenated layout, reducing padding overhead by 10‑20 %.
Provide extensible hooks for tokenization, preprocessing, and postprocessing without modifying the core engine.
These techniques collectively improve inference MFU and achieve >60 % latency reduction, with up to 60 %+ throughput gains in low‑latency scenarios.
Product integration
AIAK‑LLM is integrated with the Hugging Face ecosystem, allowing users to switch with minimal code changes. Baidu also provides three auxiliary tools: checkpoint conversion, precision‑alignment (comparing AIAK‑LLM results with Hugging Face), and performance analysis dashboards. Over 20 models are pre‑adapted, enabling “out‑of‑box” usage on Baidu Smart Cloud.
In summary, Baidu Baige’s AIAK‑LLM suite tackles the three major AI‑Infra challenges—scale, stability, and cost—by improving MFU for both training and inference, supporting heterogeneous hardware, and delivering a product that can be readily adopted by developers.
- END‑
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.