Industry Insights 12 min read

Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes

This article provides a detailed technical breakdown of modern multi‑GPU server nodes, covering component composition, storage network cards, NVSwitch interconnects, bandwidth calculations, and the architectural differences between NVIDIA A100/A800 and H100/H800 configurations for AI training workloads.

Architects' Tech Alliance

Apr 10, 2024

Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes

GPU Server Node Overview

Large‑scale model training typically uses clusters where each server hosts multiple GPUs. This summary describes the hardware composition and interconnect topology of common 8‑GPU nodes based on NVIDIA A100/A800 and H100/H800 GPUs.

8‑GPU A100/A800 Node Architecture

Two CPU sockets (NUMA) with attached memory : General‑purpose compute.

Two storage network adapter cards : Access to distributed storage.

Four PCIe Gen4 switch chips : Provide high‑speed PCIe routing.

Six NVSwitch chips : Enable full‑mesh GPU‑to‑GPU communication.

Eight GPUs (A100 or A800) : Parallel AI processing units.

Eight GPU‑dedicated NICs : Optimize intra‑node data transfer.

Typical topology diagram:

Storage Network Card Role

Efficient read/write to distributed storage, essential for feeding training data and checkpointing.

Supports node management functions such as remote SSH access, performance monitoring, and data collection.

While the vendor recommends BF3 DPU, cost‑effective alternatives (e.g., RoCE) or high‑performance InfiniBand can be used.

NVSwitch Network Structure

In a full‑mesh topology each GPU connects directly to every other GPU via NVSwitch chips. An 8‑GPU A100 node uses six NVSwitch chips.

Bandwidth (NVLink 3, 50 GB/s per lane):

12 NVLink lanes per GPU → 12 × 50 GB/s = 600 GB/s bidirectional (300 GB/s unidirectional) for A100.

8 NVLink lanes per GPU → 8 × 50 GB/s = 400 GB/s bidirectional (200 GB/s unidirectional) for A800.

Connection Types in the Topology

GPU‑to‑GPU (NV8) : Eight NVLink connections per GPU pair.

NIC connections :

NODE : Within the same CPU socket, no NUMA crossing.

SYS : Across CPU sockets, crossing NUMA.

GPU‑to‑NIC :

NODE : Same CPU socket and same PCIe switch.

NNODE : Same CPU socket but different PCIe switch.

SYS : Different CPU sockets, crossing NUMA and PCIe switches.

GPU Node Interconnect Architecture

Compute and Storage Networks

The compute network connects GPU nodes for parallel computation, data exchange, and coordinated execution. The storage network links GPU nodes to distributed storage systems for massive data ingest and result output.

RDMA Importance

Remote Direct Memory Access (RDMA) is critical for high‑performance AI workloads. Choosing between RoCEv2 (cost‑effective) and InfiniBand (peak performance) depends on budget and performance requirements.

Bandwidth Bottlenecks

Intra‑host GPU‑GPU via NVLink: 600 GB/s bidirectional (300 GB/s unidirectional).

GPU‑to‑NIC within the same host (PCIe Gen4 switch): 64 GB/s bidirectional (32 GB/s unidirectional).

Inter‑host GPU‑GPU via NIC: typical NIC provides 100 Gbps (12.5 GB/s) unidirectional, far lower than intra‑host bandwidth.

Using a 400 Gbps NIC yields little benefit unless the rest of the system supports PCIe Gen5 speeds.

8‑GPU H100/H800 Node Architecture

H100 Node Hardware Topology

Each H100 host contains four GPU chips (two fewer than the A100 configuration).

H100 chips are fabricated on a 4 nm process and feature 18 Gen4 NVLink connections per chip, delivering 900 GB/s bidirectional bandwidth.

H100 GPU Chip Details

Manufactured with 4 nm technology.

Bottom row hosts 18 Gen4 NVLink links: 18 × 25 GB/s = 900 GB/s bidirectional.

Central blue region is the L2 cache for fast temporary storage.

Side regions integrate HBM (high‑bandwidth memory) chips for graphics memory.

Source: https://community.fs.com/cn/article/unveiling-the-foundations-of-gpu-computing1.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High-performance computing GPU A100 Server Architecture AI training NVSwitch H100

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.