How SUE Ethernet Redefines AI Cluster Interconnects for Scale‑Up Performance
This article examines Broadcom's Scale Up Ethernet (SUE) framework, detailing how it addresses AI/HPC rack‑scale interconnect challenges by delivering ultra‑high bandwidth, microsecond‑level latency, memory‑semantic operations, and seamless compatibility with existing Ethernet infrastructure for large XPU clusters.
The article focuses on AI and high‑performance computing (HPC) clusters' "Scale‑Up" (single‑rack) interconnect needs and proposes the Ethernet‑based SUE framework to overcome traditional Ethernet latency, bandwidth efficiency, and semantic mismatches when connecting XPU accelerators.
Background: New Interconnect Needs and Ethernet Pain Points
Core Requirements: Scalable XPU Interconnect
High bandwidth density : each XPU requires 6.4 T–12.8 T of interconnect bandwidth for real‑time parameter synchronization.
Low latency : memory sharing and atomic operations between XPU units need microsecond‑level latency.
Ecosystem compatibility : reuse existing Ethernet cables, switches, and management software to keep deployment costs low.
Traditional Ethernet Pain Points
Integration difficulty : conventional NICs are too large to embed 8‑16 instances per XPU package.
Semantic mismatch : Ethernet operates on packet‑level network semantics, while XPU workloads rely on memory‑level load/store semantics, requiring complex middleware.
Bandwidth inefficiency : IP/UDP headers consume 40 bytes, reducing payload efficiency to 64 %–89 % for small transfers common in AI/HPC.
SUE Core Positioning and Architecture
Core Positioning: Ethernet‑Based XPU Interconnect Optimization Framework
SUE (Scale Up Ethernet) is Broadcom’s XPU single‑rack interconnect standard framework. Its goal is to retain Ethernet ecosystem advantages while solving XPU interconnect pain points, offering three key characteristics:
XPU‑native adaptation : direct memory‑semantic load/store/atomic operations between XPU units without CPU mediation.
High‑density scaling : each XPU can host 8‑16 SUE instances, providing 6.4 T–12.8 T per XPU and supporting up to 1 024 XPU in a single rack.
Ethernet ecosystem reuse : compatible with standard Ethernet switches, cables, connectors, and retimers, avoiding new hardware investments.
Technical Architecture: Layered Optimization Balancing Performance and Compatibility
SUE builds on the traditional Ethernet layered model but adds custom optimizations for XPU interconnect, forming a complete stack: XPU‑semantic, transport, data‑link, and physical layers.
Core Technical Features and Advantages
Four Major Technical Breakthroughs
High‑Density Integration and Bandwidth
Multi‑instance design : SUE instance size is 1/5 of a traditional NIC; a single XPU can integrate 8‑16 instances, delivering 6.4 T–12.8 T bandwidth.
Switch compatibility : works with Broadcom Tomahawk series switches, achieving <250 ns latency and line‑rate forwarding for 64‑byte packets.
Memory‑Semantic Adaptation and Low Latency
Direct memory operations : SUE NIC splits AXI4 write requests, encapsulates them into SUE frames with AFH headers, and forwards them over Ethernet; the destination NIC directly issues AXI4 writes, enabling end‑to‑end XPU‑to‑XPU memory access without middleware.
Latency optimization : simplified protocol stack, link‑layer retransmission (LLR), and fixed‑size frames reduce end‑to‑end latency by 46 %–75 % compared with conventional Ethernet, approaching PCIe‑switch levels.
Bandwidth Efficiency via AI Fabric Header (AFH)
Lightweight header : AFH adds only 6‑12 bytes on top of the Ethernet MAC header, raising small‑payload bandwidth efficiency to ~93.5 %.
Flexible address adaptation : address field can be 2 bytes (rack‑scale) or 4 bytes (cross‑rack), supporting device IDs, local IDs, or IP addresses.
Switch‑transparent forwarding : switches only inspect AFH key fields, ignoring extended fields, ensuring full compatibility with existing Ethernet switches.
Reliability and Flow Control
Link‑level retransmission (LLR) : immediate retransmission on packet loss at the link layer, avoiding high‑latency transport‑layer recovery.
Credit‑based flow control (CBFC) : dynamically adjusts sending rate based on credit, preventing congestion in bursty XPU traffic.
End‑to‑end encryption : supports transport‑layer encryption and authentication for secure transmission of sensitive AI model parameters.
Typical Architecture and Workflow
Two Core Topologies for Different XPU Scales
Mesh topology : suitable for small clusters (8‑64 XPU); XPU units interconnect directly via SUE instances, achieving the lowest latency.
Switched topology : suitable for large clusters (64‑1 024 XPU); standard Ethernet switches provide flexible expansion with <250 ns latency and line‑rate forwarding.
Remote Store Workflow (XPU‑to‑XPU Memory Write)
Source XPU processing element (PE) issues a store instruction.
Source XPU’s SUE NIC splits the AXI4 write request, packages it into a SUE frame (including AFH), and sends it over Ethernet.
Destination XPU’s SUE NIC receives the frame, parses it, and initiates an AXI4 write to local HBM memory.
HBM completes the write and returns an AXI completion signal to the destination SUE NIC.
Destination SUE NIC wraps the completion into a SUE response frame and sends it back to the source.
Source XPU’s SUE NIC forwards the completion to its PE, finishing the store operation. End‑to‑end latency is kept within 350‑400 ns, far lower than traditional Ethernet (650 ns‑1.4 µs).
Performance Advantages and Ecosystem Value
Performance Comparison with Traditional Ethernet
SUE outperforms conventional Ethernet across latency, bandwidth efficiency, and packet handling, making it especially suitable for AI/HPC workloads that demand microsecond‑level latency and high‑density bandwidth.
Ecosystem Value: Reusing Ethernet to Lower Deployment Barriers
Hardware reuse : utilizes existing Ethernet cables (e.g., QSFP‑DD), connectors, and retimers without new infrastructure.
Software compatibility : works with current Ethernet management tools (traffic monitoring, fault diagnosis), eliminating the need for retraining operations staff.
Future Outlook
Future work will extend SUE to cross‑rack "Scale‑Out" by supporting 4‑byte AFH addresses, enhancing end‑to‑end encryption, and adding dynamic load balancing, positioning SUE as a key interconnect technology for next‑generation AI/HPC clusters.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
