Baidu Geek Talk
May 10, 2023 · Artificial Intelligence
Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization
Baidu’s AI infrastructure combines a massive InfiniBand‑linked GPU cluster, Kunlun chips, the PaddlePaddle framework, and the Wenxin model suite with 4D hybrid parallelism, elastic fault tolerance, and a two‑stage training pipeline to overcome computation, memory, and communication walls, delivering world‑leading MLPerf performance for large‑scale LLMs.
AI InfrastructureGPU ClusterInfiniBand
0 likes · 15 min read