Why AI Training Hits a Network Wall and the Five Protocols Fighting for the Next‑Gen AI Interconnect
As AI models scale from billions to trillions of parameters and GPU clusters grow from dozens to hundreds of thousands of cards, traditional data‑center networking can no longer handle exabyte‑level traffic, prompting a fierce battle among five open‑source scale‑up protocols—ESUN, SUE, ETH‑X, OISA, and ETH+—each offering different trade‑offs in latency, compatibility, performance, and scalability.
1. Why Scale‑up Is Mandatory
Training a model like GPT‑4 requires tens of thousands of GPUs to run for weeks, generating communication volumes at the exabyte (EB) level. Conventional data‑center networks were designed for CPU‑centric, cost‑effective workloads and prioritize elasticity, not deterministic low‑latency, lossless, ultra‑efficient data transfer required by AI clusters. The gap forces a redesign of intra‑GPU communication, extending from a few cards per node to super‑nodes of 64, 128, or even 512 GPUs.
2. The Five Competing Players
Four major industry groups have released open‑scale‑up specifications:
ESUN v1.0 – a joint effort by Meta and Microsoft, built on the OCP Ethernet standard with a 4‑byte extension.
SUE (AFH Gen1/Gen2) – driven by Broadcom, offering two versions: Gen1 for compatibility and Gen2 with an aggressive 12‑byte header.
ETH‑X – Tencent’s ODCC‑backed solution that adds a 10‑byte custom field for future GPU extensions.
OISA – China Mobile’s carrier‑centric design targeting full‑duplex, point‑to‑point, non‑blocking links up to 1024 GPUs.
ETH+ – a joint Alibaba‑Chinese Academy of Sciences effort that uniquely supports both scale‑up and scale‑out in a unified architecture.
3. Detailed Protocol Dissection
ESUN v1.0
Meta and Microsoft keep the existing Ethernet ecosystem untouched, adding a 4‑byte header. Advantages: high compatibility, low risk, reuse of existing supply chains. Drawbacks: the header overhead (18 bytes total) creates a performance ceiling.
SUE (Broadcom)
Broadcom’s “double‑track” approach offers Gen1 (compatible) and Gen2 (aggressive). Gen2 compresses the header to 12 bytes, uses address compression, SLAP addressing, and a high payload‑to‑overhead ratio, delivering the strongest raw performance but requiring a new ecosystem.
ETH‑X (Tencent)
Tencent introduces a mixed‑header format with a reserved 10‑byte custom space, enabling GPU vendors to add proprietary fields. It provides flexibility and strong domestic adoption, but lacks native end‑to‑end reliable transport and caps at roughly 512 GPUs.
OISA (China Mobile)
OISA flattens the packet header and tightly couples the network and transaction layers, guaranteeing non‑blocking, point‑to‑point links up to 1024 GPUs. Its design emphasizes manageability, operability, and large‑scale scheduling for compute‑intensive workloads.
ETH+ (Alibaba + CAS)
ETH+ is the most comprehensive, supporting both scale‑up and scale‑out. It introduces a link‑bypass mechanism and a 1‑byte preamble, achieving an 85.3 % payload‑to‑overhead ratio. It also implements IFEC (in‑network computing) comparable to NVLink Sharp, and claims unlimited scalability up to 100 k GPUs.
4. Performance Metrics and Load Ratio
In AI micro‑packet scenarios (128 B cache line), each byte shaved from the header yields a noticeable performance jump. The top‑ranked protocol outperforms the worst by more than 4 % in load ratio; when extrapolated to a 100 k‑GPU cluster, this translates into substantial differences in training speed, cost, and power consumption.
The decisive capabilities identified are:
Reliable transmission – ability to sustain millions of GPUs without packet loss.
In‑network computing – acceleration of collective operations.
Ecosystem openness – avoidance of vendor lock‑in.
Only ETH+ and SUE currently provide full end‑to‑end reliable transport, and ETH+ has already released the IFEC standard.
5. Future Trends (2‑3 Years Outlook)
Open ecosystems will dominate – Ethernet‑based solutions are expected to capture ~80 % of the market, while NVLink remains a niche.
Memory‑semantic networking becomes standard – remote GPU access will behave like local memory access, a core requirement for scale‑up.
Header compression will continue – the industry is moving from 18 B → 12 B → even smaller headers.
Scale‑up and scale‑out will converge – ETH+ already demonstrates a hybrid approach.
Standard unification will accelerate – IEEE, OCP, and ODCC are actively merging specifications, ending the current fragmentation.
Ultimately, the protocol that combines strong performance, broad compatibility, open licensing, massive scalability, and full‑scenario support will define the AI interconnect standard for the upcoming era of hundred‑thousand‑GPU clusters.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
