Why AI Chips Need High‑Speed Networks: From Scaling Laws to DPU Evolution
This report analyzes how the convergence of Moore's law slowdown and large‑model scaling laws creates a feedback loop between compute power and intelligence, driving the emergence of AI‑specific chips, high‑speed networking, and DPU architectures that together reshape modern AI infrastructure.
1. The Big Background of Computing Power and Intelligence
We are currently in an era where two scaling laws clash: Moore's law, which has noticeably slowed since 2015, and the large‑model scaling law, which shows that more parameters, higher compute, and larger datasets yield higher prediction accuracy, often described as "large models achieve miracles". This creates a closed loop where compute and intelligence reinforce each other.
Compute originates from various forms of parallel systems, involving application‑level concerns such as data sharing, synchronization, consistency, task partitioning, scheduling, fault tolerance, as well as physical‑level issues like bandwidth, latency, network topology, protocols, distance, energy consumption, and cooling. The report focuses on the challenges and changes high‑speed networks face in this "compute is intelligence" era.
2. AI Chips
AI requires "AI chips", but the term should cover more than just GPUs or NPUs. CPUs, although essential for AI infrastructure, are not called AI chips because they predate AI technology. The truly transformative chips are the compute‑intensive GPU/NPU and the high‑speed I/O DPU/IPU/NIC, whose evolution is tightly linked to AI progress.
We define an AI chip as a chip or chip‑set whose architecture matches the computational characteristics of specific AI algorithms, including parallelism of models, data, pipelines, precision adaptability, iteration, probabilistic vs deterministic behavior, memory footprint, nonlinear operations, and more. The hardware side includes compute units, cache structure, array scale, interconnect topology, I/O bandwidth, instruction set, scalability, virtualization, latency, power, and reliability.
According to this definition, AI chips include familiar GPUs/NPUs and also the network chips that connect them. Effective compute equals compute multiplied by network; both are indispensable.
Analogy: CPU is the brain, GPU the muscle, and DPU the neural center. The CPU provides a universal platform, the GPU supplies massive precision‑specific compute, and the DPU ensures efficient data flow between CPUs and GPUs, acting as the root node of the data network.
3. Ultra‑Parallel GPU/NPU Architecture
Consider a simple processing element (PE) that can perform 128 INT8 MAC operations per clock. 128 PEs form a Group, 8 Groups form a Cluster, and a chip contains 4 Clusters, totaling 4096 PEs. At a 1 GHz clock, the peak performance reaches 512 TOPS @ INT8, matching current flagship AI chips.
Actual performance is lower due to algorithmic variations and data dependencies. Designing an ideal PE is challenging because the PE must match the instruction stream of applications, leading to the classic "chicken‑or‑egg" problem that influences memory hierarchy and bandwidth design.
Higher compute demands more PEs, larger on‑chip memory (HBM), and greater I/O bandwidth. A practical rule of thumb is the "10× rule": HBM bandwidth is roughly ten times the I/O bandwidth, and the inter‑PE network bandwidth is ten times the HBM bandwidth. For example, an 800 Gbps I/O link implies an HBM bandwidth on the order of 8 Tbps.
4. AI Networks Drive DPU Growth
AI chips without data are useless; data must flow through high‑speed I/O. Over the past decade, I/O performance has outpaced CPU performance growth but lags behind GPU compute growth, making high‑bandwidth I/O a key driver of GPU advancement.
Mixture‑of‑Experts (MoE) models introduce intensive all‑to‑all communication. During expert parallelism, tokens are dispatched to expert GPUs and later gathered back, consuming about 47 % of inference time in Qwen2‑style models. This communication load fuels the need for Scale‑Up networks.
Various high‑speed links (NVLink, UALink, EtherLink, ALink) have emerged because traditional networking cannot meet AI’s qualitative demands.
5. Scale‑Out vs. Scale‑Up
When LLM parameters reach trillions, a single GPU/NPU cannot hold a full model or its training state, making multi‑GPU/NPU collaboration essential. Networks become a hard requirement.
The AI/HPC network landscape can be divided into three layers: Frontend (data‑center networking), Backend Scale‑Out (server‑to‑server or node‑to‑node interconnects, typically Ethernet or InfiniBand), and Backend Scale‑Up (short‑distance, ultra‑high‑bandwidth interconnects like NVLink, Infinity Fabric, UALink). Scale‑Up provides bandwidth an order of magnitude higher than Scale‑Out and sub‑microsecond latency.
Both layers are crucial; they are not interchangeable. LLM training efficiency heavily depends on communication overhead, especially All‑Reduce and All‑to‑All primitives, which require predictable low‑latency, high‑bandwidth paths.
6. Common Technologies of Scale‑Up
Ultra‑high bandwidth (Tbps) and ultra‑low latency (sub‑microsecond).
Memory‑semantic access allowing XPU‑to‑XPU load/store/atomic operations directly on remote memory.
Single‑hop fully‑connected or mesh topologies to avoid multi‑hop latency spikes.
Built‑in reliability mechanisms (FEC, link‑level retransmission, flow control, ACK/NAK, CRC).
Tight hardware integration with the compute chip, making the interconnect part of the SoC.
7. The Evolution of DPU
DPU should be seen as a network‑side carrier that connects various resources, whether physical or virtual. Three variants are identified:
DPU‑Endpoint : Stand‑alone device on a server’s high‑speed bus (PCIe), extending NIC functionality with extensive offload capabilities.
DPU‑Switch : Central switching element that, together with DPU‑Endpoints, builds a full‑connected, lossless network topology (the "Smart Edge, Dumb Core" model).
DPU‑Phy : Integrated directly with compute chips, providing a native high‑speed network interface that can connect to external switches, exemplified by Broadcom’s SUE protocol.
8. Product Landscape and Outlook
Chinese company 中科驭数 (Zhongke Yushu) focuses on high‑speed network data‑processing chips, aiming to unify remote resource access, hardware virtualization, data security, and system operation. Their product lines include:
High‑performance NICs compatible with domestic CPUs (e.g., FlexFlow 2200T).
Ultra‑low‑latency cards for financial trading (e.g., Swift 2200N, 2502N, NDPP X500).
Cloud‑native data‑flow offload cards (e.g., Conflux 2200E, 2200P).
AI‑cluster backend cards supporting both Scale‑Out RDMA and Scale‑Up high‑performance links (e.g., FlexFlow 2200R).
These products cover 25 G to 200 G bandwidth, support X86 and domestic CPUs, and target cloud data centers, AI clusters, financial computing, 5G edge, HPC, and high‑speed storage. While they are closing the gap with foreign competitors, continued strong demand is expected, positioning DPU technology as a pivotal component of the next‑generation computing architecture.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
