How AI Model Training Is Redefining Data Center Scaling Strategies
Large‑scale AI model training now demands unprecedented bandwidth and latency performance, forcing data centers to adopt three scaling approaches—Scale‑up, Scale‑out, and Scale‑Across—while leveraging optical I/O, CPO, and optical circuit switching to overcome power, distance, and bandwidth limits.
AI Training Puts New Pressure on Data Center Interconnects
Recent years have seen massive AI models require bandwidth and latency far beyond traditional data‑center networks. While single‑GPU compute performance (TFLOPS) continues to rise, interconnect performance lags, making it the "Amdahl bottleneck" for AI infrastructure.
Three Paths to Expand Compute and Interconnect
To meet growing AI compute needs, data centers must scale in three ways:
Scale‑up (Single‑Rack Expansion)
Example NVL72 integrates 18 compute trays and 9 switch trays in one rack, effectively treating the rack as a giant GPU with shared memory. Multiple chips within the rack are fully interconnected via switch chips, requiring extremely high bandwidth and low‑latency links. Challenges include soaring power consumption and the bandwidth‑distance trade‑off of copper interconnects, which at 448 Gbps can only transmit reliably under 1 m. Optical I/O, such as silicon‑photonic transceivers integrated into chip packages, is seen as a future solution to dramatically increase "beachfront density" and break electrical I/O limits.
Scale‑out (Cluster Expansion)
By adding more server nodes and interconnecting them with high‑speed networks, larger AI clusters are built within a single data‑center. NVIDIA’s DGX SuperPOD, for example, links hundreds to thousands of GPUs via InfiniBand or Ethernet. This approach demands switches with many ports, high bandwidth density, and microsecond‑level latency. Optical modules (400 Gbps, 800 Gbps, 1.6 Tbps) are now standard, but scaling to thousands of links stresses traditional back‑plane and pluggable optics. Co‑packaged optics (CPO) integrates optical engines directly with switch ASICs, boosting bandwidth and reducing per‑bit power.
Scale‑Across (Cross‑Data‑Center Expansion)
When a single data‑center reaches power or space limits, multiple sites are linked into a "super‑cluster". This requires long‑distance, ultra‑high‑bandwidth, low‑latency optical links. NVIDIA’s Spectrum‑XGS Ethernet solution introduces a third scaling dimension, enabling inter‑city AI fabric with coherent DWDM technologies delivering 800 Gbps to 1.6 Tbps per wavelength.
Optical Circuit Switching (OCS) as a Game‑Changer
OCS replaces electronic packet switching with pure optical paths, eliminating O‑E‑O conversion and its associated latency and power costs. By using optical switch matrices (e.g., N×N crossbars), OCS can support any data rate—400 Gbps, 800 Gbps, or higher—without the port‑rate limits of electronic switches. Latency drops to tens of nanoseconds, and the absence of queues removes jitter, benefiting AI training algorithms like All‑Reduce.
Google’s TPU v4 supercomputer adopted OCS, using 3‑D MEMS micro‑mirror arrays to reconfigure links in milliseconds. The design supports 136×136 ports with 8 redundant ports, enabling 4096 TPU chips to be organized into 64 cubes, interconnected via 96 optical links to 48 distributed OCS units, forming a programmable 3‑D torus network.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
