How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture
This article analyzes the technical roadmap for upgrading AI super‑large GPU clusters to support trillion‑parameter multimodal models, covering single‑chip performance, super‑node scaling, DPU‑based compute fusion, energy‑efficient designs, converged storage, high‑throughput networking, and fault‑tolerant checkpoint strategies.
Background
As AI models scale from hundred‑billion‑parameter language models to trillion‑parameter multimodal models, data‑center clusters must increase compute throughput, memory bandwidth, and energy efficiency. The upgrades span single‑chip performance, super‑node scaling, DPU‑enabled multi‑compute fusion, and extreme performance‑to‑power ratios.
Single‑Chip Capability
Improving a GPU’s compute and memory performance involves:
Adding more parallel processing units and raising clock rates while staying within power budgets.
Optimizing cache hierarchies to lower memory latency.
Adopting lower‑precision formats (e.g., FP8) via domain‑specific architectures (DSA) to boost arithmetic density.
Integrating custom accelerators for workload‑specific kernels.
Using high‑bandwidth memory (HBM) with 2.5D/3D stacking to provide large capacity and short data paths.
Super‑Node Computing
Move beyond traditional 8‑GPU servers by building super‑node servers that increase GPU‑southbound scale‑up interconnects, improving tensor‑parallel or MoE‑parallel efficiency.
Embed scale‑up‑capable switch chips inside nodes to raise point‑to‑point (P2P) bandwidth.
Redesign GPU‑GPU interconnect protocols: new packet formats, CPO/NPO support, higher SerDes rates, enhanced congestion control, and multi‑chip C2C packaging to increase All‑2‑All efficiency and reduce latency.
Multi‑Compute Fusion via DPU
Offloading network‑related processing from CPUs to programmable DPUs reduces data‑movement overhead and provides hierarchical low‑latency networking with unified control. The proposed five‑engine DPU architecture includes:
Compute Engine: Offloads I/O data and control paths, exposing standardized virtio‑net and virtio‑blk interfaces.
Storage Engine: Provides TCP/IP or RDMA‑based back‑ends for block, object, and file storage clusters.
Network Engine: Implements virtual switches and full‑line‑speed traffic; integrates RDMA to achieve ~400 Gbps inter‑node bandwidth.
Security Engine: Uses root‑of‑trust and IPsec‑like encryption for multi‑tenant traffic protection.
Control Engine: Unifies management of bare‑metal, VM, and container compute units for end‑to‑end orchestration.
China Mobile began development of the proprietary Panshi DPU in 2020, released the first version in 2021, and migrated the FPGA implementation to ASIC in 2024.
Extreme Performance‑to‑Power Ratio
Cooling: Deploy high‑density liquid‑cooled cabinets that house multiple liquid‑cooled GPU servers, improving space utilization over traditional air‑cooled racks.
Chip‑Level Optimizations: Use advanced process nodes (7 nm or smaller) to lower transistor power, redesign on‑chip buses, refine pipelines, apply fine‑grained voltage/frequency scaling, and employ clock gating. Software‑level monitoring and workload balancing further improve energy efficiency.
High‑Performance Converged Storage
Adopt multi‑protocol, auto‑tiered storage that simultaneously supports NFS, S3, and POSIX interfaces with zero‑copy and zero‑conversion data paths, enabling seamless hand‑off between AI workflow stages and eliminating wait times.
High‑Throughput Cluster Storage
Using a global file‑system architecture, the system can scale to >3 000 nodes, delivering hundreds of petabytes of flash storage. Target metrics include 10 TB/s aggregate bandwidth, billions of IOPS, >20 % improvement in compute utilization, checkpoint restore times reduced from minutes to seconds, and 99.9999 % reliability with strong consistency.
Large‑Scale Inter‑Node Reliable Network
The cluster comprises four logical networks (parameter, data, business, management). The parameter plane demands the highest bandwidth and zero‑loss characteristics. Mature solutions include InfiniBand (IB) and RoCE; emerging technologies such as Global‑Scheduled Ethernet (GSE) and UltraEthernet (UEC) aim to overcome Ethernet limits for AI workloads.
Large‑Scale Topology
Two common topologies are recommended:
Spine‑Leaf (two‑layer): Groups of eight leaf switches connect to eight GPUs; full‑mesh links between spine and leaf switches provide 1:1 up/down convergence.
Fat‑Tree: Leaf, spine, and core switches form a full‑mesh at each tier, ensuring uniform bandwidth and redundancy.
Fault‑Tolerant High‑Performance Platform
The AI platform unifies compute, storage, and network resources under a cloud‑native container foundation, providing lifecycle management, topology‑aware scheduling, and end‑to‑end monitoring for AI tasks.
Checkpoint‑Based Fault Tolerance
Training large models relies on periodic checkpointing. A multi‑level checkpoint storage hierarchy keeps recent checkpoints in high‑speed memory and asynchronously flushes them to persistent storage. Upon failure, the system can reload in‑memory checkpoints directly, avoiding costly network I/O and reducing recovery time.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
