Industry Insights 12 min read

Inside the High‑Performance GPU Server: A Deep Dive into A100/A800 & H100 Topologies

This article provides a detailed technical analysis of multi‑GPU server architectures, covering component breakdowns, NVSwitch networking, bandwidth calculations, and the differences between NVIDIA A100, A800, and H100 configurations for large‑scale AI workloads.

Architects' Tech Alliance

Dec 14, 2024

Inside the High‑Performance GPU Server: A Deep Dive into A100/A800 & H100 Topologies

8‑Node NVIDIA A100/A800 GPU Server

The node consists of the following components:

Two CPU sockets (NUMA) with attached memory – general‑purpose processing.

Two storage network adapters – provide PCIe connectivity to distributed storage systems.

Four PCIe Gen4 switch chips – enable high‑speed PCIe routing between CPUs, GPUs and NICs.

Six NVSwitch chips – create a fully‑connected GPU‑to‑GPU fabric.

Eight NVIDIA A100 (or A800) GPUs – primary AI/ML compute units.

Eight GPU‑dedicated network adapters – optimize intra‑node GPU communication.

Storage network adapter

Connected to the CPU via PCIe, the storage adapter handles:

High‑throughput read/write of distributed training data and checkpoint files.

Node‑level management tasks such as remote SSH access, performance monitoring and data collection.

Vendors often recommend a BF3 DPU; however, RoCE can be used when bandwidth requirements are met, while InfiniBand offers the highest performance.

NVSwitch network structure

Eight GPUs are interlinked through six NVSwitch chips, forming a full‑mesh topology. Bandwidth per NVLink connection is calculated as n × bw_per_nvlink_lane. For NVLink 3 on A100 (50 GB/s per lane) the total bidirectional bandwidth per link is 12 × 50 GB/s = 600 GB/s (300 GB/s one‑way). The A800 reduces the lane count to eight, yielding 8 × 50 GB/s = 400 GB/s (200 GB/s one‑way).

The nvidia‑smi topology diagram for an 8 × A800 configuration shows:

GPU‑to‑GPU links are labeled NV8 , indicating eight NVLink lanes.

NIC connections are marked NODE when both GPUs reside on the same CPU socket (no NUMA crossing) and SYS when crossing NUMA domains.

GPU‑to‑NIC links are NODE (same CPU and PCIe switch), NNODE (same CPU but different PCIe switches), or SYS (different CPUs).

GPU node interconnect architecture

The overall inter‑node connectivity consists of a compute network and a storage network. Both rely on RDMA (Remote Direct Memory Access) to meet AI workload latency and throughput requirements. When choosing an RDMA technology, RoCE v2 offers a cost‑effective solution, while InfiniBand provides the highest raw bandwidth and lowest latency.

Bandwidth bottlenecks

GPU‑to‑GPU via NVLink: 600 GB/s bidirectional (A100) or 400 GB/s bidirectional (A800).

GPU‑to‑NIC via PCIe Gen4 switch: 64 GB/s bidirectional (32 GB/s one‑way).

Inter‑host GPU‑to‑GPU via NIC: typical NIC provisioned at 100 Gbps (12.5 GB/s) per direction; higher‑speed NICs (200 Gbps or 400 Gbps) exceed PCIe Gen4 limits and require PCIe Gen5 support for full utilization.

8‑Node NVIDIA H100/H800 GPU Server

Internal hardware topology

Each H100 node contains four GPUs (half the count of the A100 eight‑GPU design).

H100 GPUs are fabricated on a 4 nm process and feature 18 Gen4 NVLink connections per GPU, delivering 18 × 25 GB/s = 900 GB/s bidirectional bandwidth.

H100 GPU chip details

Manufactured with a 4 nm process, providing high transistor density and power efficiency.

Bottom row contains 18 Gen4 NVLink ports, giving a total bidirectional bandwidth of 900 GB/s.

Central blue region is the L2 cache for fast temporary data storage.

Side regions integrate HBM memory stacks for high‑bandwidth data access.

Source: https://community.fs.com/cn/article/unveiling-the-foundations-of-gpu-computing1.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High-performance computing RDMA GPU architecture AI hardware network topology NVSwitch

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.