Evolution and Forecast of Nvidia NVLink, NVLink C2C, and B100/X100 GPU Architectures
The article analyses the historical evolution of Nvidia's NVLink and NVLink C2C interconnect technologies, compares them with PCIe, Ethernet and InfiniBand, and uses these trends to predict future AI‑chip architectures such as the B100 and X100 GPUs, highlighting design trade‑offs and packaging challenges.
Building on previous analyses of Nvidia AI‑chip roadmaps, this piece examines how interconnect technologies shape the physical architecture of chips and systems. By tracing the development of interconnects and considering layout and process constraints, the author forecasts Nvidia's future AI‑chip designs and identifies emerging interconnect requirements.
NVLink and NVLink C2C Evolution – Interconnect technology evolves gradually, with bandwidth, modulation, and encoding following stable physical laws. Rather than delving into low‑level details, the article discusses NVLink and NVLink C2C from a macro‑logic perspective, outlining their historical generations and projecting future directions.
NVLink has progressed through four generations, while NVLink C2C currently has one. By comparing NVLink's speed evolution with competing protocols (PCIe, Ethernet, CXL, InfiniBand), the analysis shows that NVLink originally targeted GPU‑GPU links and inherited the PCIe interface for GPU‑CPU connections. NVLink’s SerDes rates sit between contemporary PCIe and Ethernet rates, allowing it to leverage mature Ethernet technology for higher speed and lower cost.
Key technical distinctions are highlighted: NVLink 3.0 (50 Gbps) uses NRZ modulation instead of Ethernet’s PAM4, achieving error‑free transmission without FEC and thus lower latency. In contrast, InfiniBand in the same generation follows Ethernet’s PAM4, losing its latency advantage. Similar patterns repeat in the 100 Gbps and 200 Gbps generations, where all three interfaces share identical SerDes technology, but NVLink remains a private ecosystem without cross‑generation compatibility, giving it design flexibility and premium pricing potential.
NVLink’s development can be divided into two stages: generations 1.0‑3.0 focus on intra‑box and intra‑rack GPU interconnects, competing directly with PCIe by adopting faster Ethernet‑derived SerDes and introducing NVSwitch for bus‑domain networking. From NVLink 4.0 onward, the technology moves beyond the box, targeting InfiniBand and Ethernet networks, likely employing lightweight FEC with link‑level retransmission to achieve low latency and high reliability for memory‑semantic networking.
B100 GPU Architecture Forecast – Using the H100 layout as a baseline, the B100 is envisioned as a dual‑die GPU with two possible stitching approaches: HBM‑edge stitching (doubling IO edge length but not HBM capacity) and IO‑edge stitching (doubling HBM edge length but not IO width). Given the need for >2× performance improvements across memory, compute, and interconnect, the IO‑edge stitching option is deemed more probable, requiring higher IO density. The B100 may consist of heterogeneous dies, occupy 3.3‑3.9× the reticle area (within current CoWoS limits), and reuse NVLink C2C for die‑to‑die communication.
Separating the IO die from the compute die can normalize the compute die, enabling series production and flexible configuration. Two architectures are possible: homogeneous compute‑IO die co‑packaging, or separate packaging with C2C interconnect.
X100 GPU Architecture Forecast – A single‑socket four‑die X100 would exceed a 6× reticle area, surpassing the 2025 advanced‑packaging target. To stay within constraints, 3D‑stacked HBM must be integrated on the compute dies, posing significant thermal challenges. Alternatively, adopting a SuperChip approach (extending the B100 dual‑die design) with three architecture options (heterogeneous die co‑packaging, homogeneous compute‑IO die co‑packaging, or separate packaging with C2C) can alleviate area limits, requiring only enhanced NVLink C2C capabilities.
In summary, the analysis predicts that NVLink 5.0 will likely reach 200 Gbps per lane with up to 32 links per GPU, while NVSwitch 4.0 could scale port counts and total bandwidth dramatically, supporting the next generation of AI accelerators.
— Author: Lu Yuchun | Source: https://www.chaspark.com/#/hotspots/950120945305616384
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.