Evolution of NVIDIA GPU Architectures from Fermi to Ampere
This article provides a comprehensive overview of NVIDIA's GPU architecture evolution—covering Fermi, Kepler, Maxwell, Pascal, Volta, Turing, and Ampere—detailing compute capabilities, SM structures, specialized units such as Tensor Cores, and their impact on AI and high‑performance computing workloads.
Evolution of NVIDIA GPU Architectures
NVIDIA has continuously refined its GPU designs, releasing a series of architectures—Fermi, Kepler, Maxwell, Pascal, Volta, Turing, and Ampere—each introducing new compute capabilities, SM configurations, and specialized hardware (e.g., Tensor Cores) that significantly boost AI training, inference, and HPC performance.
Fermi (Compute Capability 2.0, 2.1)
SM composition includes 2 Warp Schedulers, 32 CUDA cores (arranged in two lanes), 2 FP64 units, 16 LD/ST units, and 4 SFUs. Each CUDA core contains one FP32 FMA unit and one integer ALU, enabling 16 double‑precision FMAs per cycle.
Kepler (Compute Capability 3.0, 3.2, 3.5, 3.7)
SM (named SMX) expands to 4 Warp Schedulers, 8 Dispatch Units, 192 CUDA cores, and 64 dedicated double‑precision units, improving FP64 performance over previous generations.
Maxwell (Compute Capability 5.0, 5.2, 5.3)
SM size is reduced to 128 CUDA cores, 4 Warp Schedulers, 8 Dispatch Units, and 32 SFU/LD‑ST units. The architecture removes dedicated FP64 units, focusing on power efficiency and a 1.4× per‑core performance increase.
Pascal (Compute Capability 6.0, 6.1, 6.2)
SM contains 2 Warp Schedulers, 4 Dispatch Units, 64 CUDA cores, 32 FP64 units (return of double‑precision), and 16 SFU/LD‑ST units. Key innovations include support for FP16 half‑precision, NVLink, HBM2 memory, and Unified Memory, targeting deep‑learning and HPC workloads.
Volta (Compute Capability 7.0, 7.2)
Introduces the Tensor Core, a dedicated matrix‑multiply‑accumulate unit. Each SM houses 4 Warp Schedulers, 64 FP32 cores, 64 INT32 cores, 32 FP64 cores, 8 Tensor Cores, 32 LD/ST units, and 4 SFUs. The architecture separates FP32 and INT32 pipelines, reducing cycles per FMA and adding per‑thread program counters for finer‑grained warp scheduling.
Turing (Compute Capability 7.5)
Combines Ray‑Tracing (RT) cores with 3rd‑generation Tensor Cores. Each SM includes 64 CUDA cores, 8 Tensor Cores, 1 RT core, 1 256 KB register file, and 4 texture units with up to 96 KB of shared/L1 memory. The design supports simultaneous FP32 and INT32 execution.
Ampere (Compute Capability 8.0)
Features the GA100 GPU with up to 108 SMs, 64 CUDA cores per SM, 4 third‑generation Tensor Cores per SM, and up to 6 HBM2 stacks. The architecture delivers up to 20× AI training speed‑up (FP16/FP32) and 2.5× FP64 performance for scientific computing, along with Multi‑Instance GPU (MIG) partitioning, third‑generation NVLink, and structural sparsity.
All architectures share a common SIMT programming model, with each new generation adding more SMs, increasing on‑chip resources, and enhancing specialized units to meet the growing demands of AI, graphics, and high‑performance computing.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.