Industry Insights 15 min read

How ZCube Redefines 20‑Year‑Old Networking Logic to Boost GPU Throughput by 15%

ZCube, a new flat networking architecture deployed by Zhipu in its GLM‑5.1 inference cluster, eliminates structural congestion, delivering a 15% throughput gain, 40.6% latency reduction, and one‑third lower hardware cost without adding GPUs, signaling a shift from raw compute to system efficiency in AI infrastructure.

Machine Heart

May 21, 2026

How ZCube Redefines 20‑Year‑Old Networking Logic to Boost GPU Throughput by 15%

ZCube Overview and Motivation

AI development has entered a phase where scaling hardware alone is insufficient; network links become a critical performance variable once GPU clusters reach large sizes. On May 5, 2026, OpenAI, together with NVIDIA, AMD, Intel, Microsoft, and Broadcom, released the Multipath Reliable Connection (MRC) protocol via the Open Compute Project, targeting ultra‑large AI clusters such as the NVIDIA GB200 supercomputer used for training ChatGPT.

Zhipu was the first to implement the next‑generation networking architecture, ZCube, in its production GLM‑5.1 inference cluster, achieving a 15% throughput increase, a 40.6% reduction in TTFT P99 latency, and a one‑third reduction in switch and optical module costs, all without adding GPUs or changing application code.

Traditional Networking vs. Inference Traffic

Conventional data‑center traffic is statistically uniform, leading to the widespread adoption of Fat‑Tree/Clos architectures that rely on ECMP for load balancing. These designs work well for training workloads, which have relatively fixed communication patterns.

Inference workloads, however, separate Prefill and Decode stages, creating highly asymmetric and dynamic traffic. KV‑Cache data flows between Prefill and Decode nodes with widely varying context lengths, causing unpredictable load distribution across GPUs and resulting in two types of congestion.

Two Congestion Categories

Unavoidable congestion: Multiple GPUs target the same destination, causing contention at the final hop. This is a physical limitation mitigated by congestion control and traffic shaping.

Avoidable congestion: Caused by topology and traffic mapping that funnel traffic to a few leaf switches, creating hotspots even when total bandwidth is sufficient. This stems from architectural design and cannot be solved by parameter tuning alone.

ZCube Design Logic

ZCube eliminates the root cause of avoidable congestion through three architectural layers.

Layer 1 – Flat Topology

Traditional Clos separates spine and leaf layers, requiring traffic to traverse spine switches, adding latency and congestion risk. ZCube removes the spine layer and connects leaf switches in a complete bipartite graph (odd‑indexed leaves to even‑indexed leaves). The network diameter becomes two hops, balancing between single‑layer (1‑hop, limited scale) and double‑layer Clos (3‑hop, higher latency).

Layer 2 – Mixed Single‑Track and Multi‑Track Access

Each GPU NIC has two ports: one follows a “multi‑track” pattern (identical‑indexed GPUs connect to the same odd leaf), and the other follows a “single‑track” pattern (consecutive GPUs connect to the same even leaf). This results in a unique optimal path between any two GPUs, removing the need for multi‑path routing and its associated load‑balancing errors.

Layer 3 – Cost‑Effective Scalability and Fault Tolerance

By eliminating the spine, ZCube reduces the number of switches and optical modules by roughly one‑third for the same cluster size. With a 400 Gb/s network, a single‑layer ZCube can interconnect 16 384 GPUs; using next‑generation 102.4 Tb/s switches and four‑port ConnectX‑8 NICs, the architecture can scale to 65 536 GPUs. Fault tolerance improves because the flat topology lacks hard isolation planes, lowering the probability of an unreachable GPU by over 50% compared to dual‑plane Clos networks.

Experimental Validation

The GLM‑5.1 coding inference cluster was used as a clean testbed: GPU models, software stack, and application code remained unchanged; only the networking topology switched from traditional ROFT to ZCube.

Throughput increase: Over 15% more API requests per second on identical hardware.

Latency reduction: TTFT P99 dropped by 40.6%, improving user‑perceived responsiveness.

Hardware cost saving: Switch and optical module expenses fell by one‑third, translating to an estimated 210–640 million CNY reduction for a ten‑thousand‑GPU deployment.

The upgrade incurs minimal marginal cost because it replaces only the networking layer, a crucial advantage when GPU supply remains tight and prices stay high.

Broader Industry Implications

ZCube’s deployment, together with OpenAI’s MRC protocol, signals a shift in AI infrastructure focus from raw compute to overall system efficiency. As GPU supply tightens, procurement costs rise, and marginal returns from adding GPUs diminish, architectural innovations that unlock additional compute without new hardware become strategically valuable.

Historically, NVIDIA’s 2019 acquisition of Mellanox cemented InfiniBand’s dominance in AI data‑center networking. Emerging forces—such as the Ultra Ethernet Consortium’s standards, rapid growth in AI‑specific optical transceivers (TrendForce projects a 57% market increase from 2025 to 2026), and ASIC‑centric designs favoring open Ethernet—are reshaping the landscape.

ZCube reduces reliance on high‑end spine switches, demanding higher port density on leaf switches and encouraging a move from a “few high‑end + many mid‑range” switch pyramid to a “many high‑density + faster optics” flat topology.

Conclusion

Network‑level innovations like ZCube can deliver a return on investment far exceeding intuition: a 15% compute boost and one‑third cost reduction without additional GPUs. As inference clusters scale to tens or hundreds of thousands of GPUs, the exponential growth of network bottlenecks makes such architectural advances increasingly decisive in the AI compute race.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Network Architecture performance optimization GPU Cluster ZCube AI networking MRC protocol

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

ZCube Overview and Motivation

Traditional Networking vs. Inference Traffic

Two Congestion Categories

ZCube Design Logic

Layer 1 – Flat Topology

Layer 2 – Mixed Single‑Track and Multi‑Track Access

Layer 3 – Cost‑Effective Scalability and Fault Tolerance

Experimental Validation

Broader Industry Implications

Conclusion

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Layer 1 – Flat Topology

Layer 2 – Mixed Single‑Track and Multi‑Track Access

Layer 3 – Cost‑Effective Scalability and Fault Tolerance