How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training
With AI training demands outgrowing single‑chip GPU clusters, this article explains how to construct and speed up heterogeneous AI clusters—combining GPUs, Kunlun, and Ascend chips—by addressing interconnect, distributed parallel strategies, and specialized acceleration suites to achieve high MFU and efficient large‑model training.
Due to changes in the external environment, the overall scale of GPUs used for large‑model training can no longer keep growing. Existing GPU clusters remain the main source of AI compute, while domestic AI chips are being deployed at scale, leading to a multi‑chip landscape in data centers.
The largest AI training clusters have already expanded from thousands of GPUs to tens of thousands, and many current GPU clusters (tens to hundreds of servers) cannot meet future large‑model demands. Consequently, mixing GPUs with chips such as Kunlun and Ascend in a single cluster becomes a natural choice.
1 How to Build and Accelerate a GPU Cluster
To illustrate the three key aspects of building and accelerating an AI cluster, we first use a GPU cluster as an example.
1.1 Achieve GPU Inter‑connect
Within a single server, eight GPUs are linked via NVLink. Between servers, GPUs are connected through an RDMA network. After the network is built, NVIDIA’s NCCL library enables GPUs to communicate and synchronize data, allowing training to progress step by step.
1.2 Define Distributed Parallel Strategy
Training tasks are split across all GPUs in the cluster. Common strategies include data parallelism (splitting the training data) and pipeline parallelism (splitting model layers). The optimal strategy depends on the cluster topology and model size, ensuring all GPUs execute steps simultaneously—computation and communication occur in lockstep without idle GPUs.
1.3 Deploy AI Acceleration Suite
The model and data, once partitioned, are deployed as operators on GPUs. An AI acceleration suite—covering data loading, CUDA libraries, and NCCL communication—optimizes the compute pipeline. Techniques such as data prefetching overlap I/O with GPU computation, and using optimized NVIDIA operators or newer kernels improve throughput.
2 Differences When Building Clusters with Different Chips
Most data‑center deployments still use a single‑chip‑per‑cluster approach, requiring custom designs for each chip. Comparing Kunlun and Ascend 910B:
Inter‑connect: Kunlun servers use XPU Link internally and standard RDMA between servers, with XCCL as the communication library. Ascend 910B uses HCCS internally and a proprietary RDMA, with HCCL for communication.
Parallel strategy: NVIDIA GPUs and Kunlun adopt an 8‑GPU‑per‑node layout, while Ascend 910B uses 16 GPUs per node split into two 8‑GPU groups, requiring tailored parallel strategies.
Acceleration suite: Because of differences in compute capability, memory size, I/O bandwidth, and communication libraries, each chip needs its own optimized operator library and acceleration tactics.
3 Challenges and Solutions for Multi‑Chip Heterogeneous Clusters
The goal is a single cluster that supports mixed‑chip training for a large model.
3.1 Cross‑Chip Inter‑connect and Cluster Construction
Traditional views consider different chips hard to inter‑connect. Baidu Baige achieves cross‑chip connectivity by using CPU forwarding between Ascend 910B sub‑clusters and GPU sub‑clusters, and employs the self‑developed BCCL library to enable optimal RDMA‑based communication among GPUs, Kunlun, and other chips.
3.2 Adaptive Parallel‑Strategy Search to Boost Overall Efficiency
Conventional parallel strategies assume uniform chip performance, which leads to idle high‑performance chips in heterogeneous clusters. Baidu Baige’s AIAK‑LLM suite performs an adaptive parallel‑strategy search: it evaluates compute, memory, and communication costs for each chip using an AI‑chip performance matrix, then determines a non‑uniform distribution of workload (data, model layers) that maximizes overall throughput.
3.3 Accelerator Abstraction Layer to Mask Hardware Differences
Instead of chip‑specific optimizations, an “Accelerator” abstraction layer decouples operators from hardware. Chip vendors only need to tune their own operators, while Baidu’s higher‑level strategies (communication overlap, tensor parallelism, etc.) are automatically applied across all chips, achieving high MFU values on both GPU and domestic chips.
3.4 Technical Metrics for Mixed‑Chip Training
After addressing the three challenges, mixed‑chip training can reach up to 97% (hundred‑card scale) and 95% (thousand‑card scale) MFU efficiency. MFU (Model FLOPs Utilization) is defined as the observed model throughput divided by the theoretical peak FLOPs throughput. For example, combining a 100‑unit NVIDIA A800 cluster with an 80‑unit domestic‑chip cluster yields an effective throughput of 180 × 0.97 = 174.6 units.
4 Unified Compute Power for Future Growth
Baige’s solution hides the complexity of heterogeneous environments, unifying diverse compute resources into a single large cluster that can scale with business needs. The service is available both on Baidu Cloud public offerings and the ABC Stack private cloud.
5 Glossary
MFU – Model FLOPs Utilization
MFU = (actual observed model throughput) / (theoretical maximum throughput at peak FLOPs). A higher MFU indicates more efficient use of GPU floating‑point capacity, leading to faster training.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
