Why Rubin288’s Orthogonal CLOS Architecture Beats Traditional Designs
The article analyzes NVIDIA's Rubin288 high‑density GPU cabinet, comparing its orthogonal CLOS architecture with the older non‑orthogonal designs, and explains how the new layout improves reliability, bandwidth, scalability, and cooling for modern data‑center HPC deployments.
Introduction
Rubin288 is NVIDIA's next‑generation high‑density GPU rack that can accommodate up to 288 GPUs in a single cabinet, delivering four times the density of its predecessor NVL72.
Problems with Legacy Cable‑Tray Architecture
Previous designs used a cable‑tray (non‑orthogonal) architecture where compute and switch boards were parallel and connected via a backplane. This caused signal interference, limited bandwidth upgrades, and required complex PCB routing, leading to low manufacturing yield and reliability issues.
The non‑orthogonal CLOS architecture is typical in campus‑core networking equipment such as Huawei S12700E‑08, H3C S10500X series, and Ruijie RG‑N18010‑E series.
Orthogonal CLOS Architecture
In the orthogonal design, compute and switch boards are placed at 90° angles and connected directly via high‑speed orthogonal connectors, eliminating the backplane. This reduces signal attenuation, increases bandwidth, and allows seamless capacity scaling to hundreds of Tbps.
Hardware using this zero‑backplane approach includes CloudEngine 16800 series, H3C S12500X series, and Ruijie RG‑N18010‑XH.
Comparison with NVL72
NVL72 employed a non‑orthogonal layout with 18 compute nodes and 9 switch nodes per rack, each compute node holding four GPUs and each switch node containing two 28.8 TB NVLink chips. All GPUs were interconnected via a copper‑cable backplane, which suffered from reliability challenges and difficult maintenance.
These issues motivated the shift to an orthogonal architecture for Rubin288.
Rubin288 Orthogonal Design
Rubin288 replaces the backplane with direct copper connectors between compute and switch nodes, secured by lock mechanisms. Each compute node spans two standard rack widths, fitting eight Rubin GPUs and two CPUs.
Network cards are arranged such that ScaleOut NICs are not PCIe‑attached to compute boards, while FrontEnd NICs may be present in small numbers. The compute tray occupies 36U and holds all 288 GPUs; the switch tray uses next‑generation CX9/10 NICs.
Advantages of the Orthogonal Architecture
Reliability and Maintainability: Both compute and switch boards support N+1 redundancy and hot‑swap replacement, reducing downtime to a fraction of the system.
Network Optimization: A single‑layer switch topology avoids the complexity of multi‑layer designs, and placing ScaleOut NICs directly in the ScaleUP domain simplifies traffic flow.
Higher Yield and Simpler Cabling: Eliminating the backplane removes a major source of manufacturing defects, increasing yield by more than an order of magnitude.
Thermal and Power Considerations: The design demands megawatt‑scale power and liquid‑cooling solutions to handle the dense GPU deployment.
Conclusion
The orthogonal CLOS architecture of Rubin288 addresses the reliability, scalability, and thermal challenges of previous backplane‑based systems, representing a clear trend toward zero‑backplane, high‑density HPC solutions for future data‑center deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
