Industry Insights 18 min read

How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture

This article analyzes the technical roadmap for upgrading AI super‑large GPU clusters to support trillion‑parameter multimodal models, covering single‑chip performance, super‑node scaling, DPU‑based compute fusion, energy‑efficient designs, converged storage, high‑throughput networking, and fault‑tolerant checkpoint strategies.

Architects' Tech Alliance

Sep 15, 2024

How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture

Background

As AI models scale from hundred‑billion‑parameter language models to trillion‑parameter multimodal models, data‑center clusters must increase compute throughput, memory bandwidth, and energy efficiency. The upgrades span single‑chip performance, super‑node scaling, DPU‑enabled multi‑compute fusion, and extreme performance‑to‑power ratios.

Single‑Chip Capability

Improving a GPU’s compute and memory performance involves:

Adding more parallel processing units and raising clock rates while staying within power budgets.

Optimizing cache hierarchies to lower memory latency.

Adopting lower‑precision formats (e.g., FP8) via domain‑specific architectures (DSA) to boost arithmetic density.

Integrating custom accelerators for workload‑specific kernels.

Using high‑bandwidth memory (HBM) with 2.5D/3D stacking to provide large capacity and short data paths.

Super‑Node Computing

Move beyond traditional 8‑GPU servers by building super‑node servers that increase GPU‑southbound scale‑up interconnects, improving tensor‑parallel or MoE‑parallel efficiency.

Embed scale‑up‑capable switch chips inside nodes to raise point‑to‑point (P2P) bandwidth.

Redesign GPU‑GPU interconnect protocols: new packet formats, CPO/NPO support, higher SerDes rates, enhanced congestion control, and multi‑chip C2C packaging to increase All‑2‑All efficiency and reduce latency.

Multi‑Compute Fusion via DPU

Offloading network‑related processing from CPUs to programmable DPUs reduces data‑movement overhead and provides hierarchical low‑latency networking with unified control. The proposed five‑engine DPU architecture includes:

Compute Engine: Offloads I/O data and control paths, exposing standardized virtio‑net and virtio‑blk interfaces.

Storage Engine: Provides TCP/IP or RDMA‑based back‑ends for block, object, and file storage clusters.

Network Engine: Implements virtual switches and full‑line‑speed traffic; integrates RDMA to achieve ~400 Gbps inter‑node bandwidth.

Security Engine: Uses root‑of‑trust and IPsec‑like encryption for multi‑tenant traffic protection.

Control Engine: Unifies management of bare‑metal, VM, and container compute units for end‑to‑end orchestration.

China Mobile began development of the proprietary Panshi DPU in 2020, released the first version in 2021, and migrated the FPGA implementation to ASIC in 2024.

Extreme Performance‑to‑Power Ratio

Cooling: Deploy high‑density liquid‑cooled cabinets that house multiple liquid‑cooled GPU servers, improving space utilization over traditional air‑cooled racks.

Chip‑Level Optimizations: Use advanced process nodes (7 nm or smaller) to lower transistor power, redesign on‑chip buses, refine pipelines, apply fine‑grained voltage/frequency scaling, and employ clock gating. Software‑level monitoring and workload balancing further improve energy efficiency.

High‑Performance Converged Storage

Adopt multi‑protocol, auto‑tiered storage that simultaneously supports NFS, S3, and POSIX interfaces with zero‑copy and zero‑conversion data paths, enabling seamless hand‑off between AI workflow stages and eliminating wait times.

High‑Throughput Cluster Storage

Using a global file‑system architecture, the system can scale to >3 000 nodes, delivering hundreds of petabytes of flash storage. Target metrics include 10 TB/s aggregate bandwidth, billions of IOPS, >20 % improvement in compute utilization, checkpoint restore times reduced from minutes to seconds, and 99.9999 % reliability with strong consistency.

Large‑Scale Inter‑Node Reliable Network

The cluster comprises four logical networks (parameter, data, business, management). The parameter plane demands the highest bandwidth and zero‑loss characteristics. Mature solutions include InfiniBand (IB) and RoCE; emerging technologies such as Global‑Scheduled Ethernet (GSE) and UltraEthernet (UEC) aim to overcome Ethernet limits for AI workloads.

Large‑Scale Topology

Two common topologies are recommended:

Spine‑Leaf (two‑layer): Groups of eight leaf switches connect to eight GPUs; full‑mesh links between spine and leaf switches provide 1:1 up/down convergence.

Fat‑Tree: Leaf, spine, and core switches form a full‑mesh at each tier, ensuring uniform bandwidth and redundancy.

Fault‑Tolerant High‑Performance Platform

The AI platform unifies compute, storage, and network resources under a cloud‑native container foundation, providing lifecycle management, topology‑aware scheduling, and end‑to‑end monitoring for AI tasks.

Checkpoint‑Based Fault Tolerance

Training large models relies on periodic checkpointing. A multi‑level checkpoint storage hierarchy keeps recent checkpoints in high‑speed memory and asynchronously flushes them to persistent storage. Upon failure, the system can reload in‑memory checkpoints directly, avoiding costly network I/O and reducing recovery time.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

storage architecture energy efficiency GPU clusters high‑performance networking AI compute DPU

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.