Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations
Baidu’s Baige 5.0 AI Computing Platform introduces FP8 mixed‑precision training, MoE‑aware distributed strategies, adaptive parallelism, and a three‑tier KV‑Cache, delivering over 30% training speedup and 50% inference throughput gains while keeping token latency under half a second for large‑scale models.
Overview
On August 29, 2025, Baidu Cloud Intelligence Conference featured a keynote by Wang Yanpeng, chief scientist of AI Computing, announcing the release of Baidu Baige AI Computing Platform 5.0 and showcasing several impressive performance figures.
From Baige 4.0 to Baige 5.0
Baige 4.0’s core feature was stable training for clusters with up to 100 k cards, targeting large‑scale training scenarios. Over the past year, AI workloads have shifted from a single pre‑training‑centric load to a mixed‑load stage that includes online inference, SFT, and reinforcement learning.
Consequently, Baige 5.0 delivers a comprehensive upgrade:
Infrastructure layer : high‑performance AI compute, AI network, and AI storage products that enable efficient training‑inference mixing within a single cluster.
Engineering layer : end‑to‑end capabilities covering data preparation, model development, training, and inference deployment.
Model acceleration layer : deep performance acceleration for dense, MoE, and other popular open‑source models, provided to customers as easy‑to‑use product features.
MoE Model Emergence
The rise of MoE models is a key innovation that continues the Scaling Law while keeping compute growth modest by activating only a subset of expert layers. However, MoE introduces three major infrastructure challenges: massive parameter growth, a surge in communication volume, and increased system complexity.
System‑Level Upgrades in Baige 5.0
To address these challenges, Baige 5.0 implements full‑stack optimizations from chips to frameworks:
FP8 mixed‑precision training : Compared with BF16, FP8 halves the compute load and offers huge theoretical performance gains. A production‑ready FP8 training pipeline integrates FP8 GEMM kernels and a quantization‑scale strategy across the entire training flow, ensuring stable loss convergence.
Operator fusion : Fine‑grained fusion of the additional FP8 scale operators into forward and backward passes dramatically reduces operator overhead.
Communication optimization : Parallelism strategies are adjusted so that communication also uses FP8, preserving the performance benefits of FP8 throughout the pipeline.
These techniques together achieve more than a 30% end‑to‑end training speedup.
Distributed Parallelism for MoE
MoE’s sparsity and compute imbalance require re‑thinking parallel strategies:
Heterogeneous Tensor‑Parallel (TP) slicing : Different MLP sizes and attention dimensions use distinct TP configurations (e.g., TP4, TP8) for efficient compute partitioning.
Dynamic Pipeline Parallel (PP) slicing : A dynamic balancing strategy distributes MoE layers and dense layers across pipeline stages to keep workloads balanced.
Communication hiding : Techniques such as DeepEP and batch‑level parallelism hide the massive MoE communication within computation, reducing GPU idle time.
These optimizations raise training throughput by over 50%.
Inference Innovations
Inference has been transformed from a single‑card setup to a distributed system spanning dozens of machines. Key advances include:
PD (Prefill‑Decode) separation : Fully separating visual encoder (VIT) and language model, as well as attention and MLP (expert) computations, allows each part to run on the most suitable hardware.
Dynamic DP load balancing (DPLB) : Token‑level scheduling eliminates queueing delays and balances workload across GPUs.
Expert load balancing (EPLB) : Real‑time statistics guide dynamic routing of tokens to experts, mitigating bottlenecks caused by uneven expert activation.
Combined, these techniques deliver more than a 50% inference throughput increase and keep first‑token latency under 0.5 s for 16 K inputs and around 3 s for 128 K inputs.
Three‑Tier Distributed KV‑Cache
To alleviate HBM capacity limits, Baidu builds a three‑level KV‑Cache hierarchy (HBM → DRAM → SSD) with a real‑time global index that pre‑fetches cache entries, dramatically improving hit rates for repeated system prompts in agent applications.
Hardware Collaboration: Kunlun Chip
The Kunlun P800 chip features independent tensor and general cores, enabling compute‑communication parallelism ideal for MoE workloads. Its high tensor‑core density pushes GEMM and attention efficiencies above 80%, making it one of the best chips for MoE inference.
For training, Baidu also offers a 32‑card Kunlun super‑node with a fully‑meshed interconnect, providing easy deployment, low cost increase over traditional 8‑card servers, and sufficient performance to support all mainstream MoE models.
Performance Results
MoE training throughput improves by 30% (single‑card throughput nearly doubles).
Inference throughput doubles.
Industry Impact
Customers such as an education provider with massive multimodal workloads (photo‑based problem solving, homework grading) benefit from Baige 5.0’s hardware adaptation, mixed‑load resource sharing, and performance acceleration, achieving lower inference costs, higher resource utilization, and faster model iteration.
Conclusion
Baige 5.0 reconstructs AI computing infrastructure around MoE models, covering FP8 quantization, distributed parallelism, PD‑separated inference, adaptive scheduling, and Kunlun‑chip co‑design. Each optimization directly addresses the mixed‑load reality of modern AI, enabling enterprises to harness AI at scale with reduced cost and latency.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
