From Tesla to Hopper: How NVIDIA GPU Architectures Powered the AI Revolution
This article traces the evolution of NVIDIA GPU architectures—from the early Tesla series through Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, and up to the upcoming Blackwell—explaining their hardware innovations, CUDA programming model, and how each generation enabled breakthroughs in high‑performance computing, deep learning, and AI applications.
Preface
This article reviews the full history of NVIDIA GPU architectures, their technical features, and the CUDA programming model, covering Tesla, Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, and the upcoming Blackwell.
NVIDIA GPGPU and the Birth of CUDA
In 2001, Stanford professor Bill Dally’s team introduced the Stream Processor architecture, which attracted NVIDIA’s attention. Stream processors process large data streams in parallel, emphasizing data flow, parallel computation, and high throughput.
By 2003, Bill Dally became a consultant for NVIDIA, contributing to the Tesla architecture. In 2004, Ian Buck published "Brook for GPUs" and later founded the CUDA project, enabling C++ programming on GPUs.
CUDA 1.0 launched in 2006 with the GeForce 8800 GTX, delivering 345.6 GFLOPS single‑precision performance.
GPU/CUDA in HPC and AI
GPU hardware and CUDA software became the foundation of many supercomputers (e.g., Tianhe‑1, Sunway 6000). The 2012 ImageNet competition marked a turning point when a CUDA‑based AlexNet on a Kepler GPU dramatically reduced error rates, launching the AI boom.
2008 Tesla Architecture
The first unified‑shader Tesla architecture combined vertex and pixel shaders and introduced a full graphics pipeline.
1. DSA Chip Architecture
Key components include Host CPU, System Memory, PCIe Bridge, Host Interface, Input Assembler, Vertex Work Distribution, Pixel Work Distribution, Compute Work Distribution, High‑Definition Video Processors, and others.
Each TPC contains a Geometry controller, SM controller, two SMs, texture units, L1 cache, inter‑connection network, raster pipelines, shared L2 cache, DRAM, and a display interface.
SM (Stream Multi‑Processor)
An SM integrates scalar and vector processing, containing an I‑Cache, instruction issue logic, constant cache, 8 streaming processors (SP), special function units (SFU), shared memory, and registers.
SP (Streaming Processor)
Each SP is a scalar unit capable of FP32 add, multiply‑add, and INT32 operations.
2. New Features of Tesla
Unified memory address space (global, shared, local).
Support for atomic load/store on shared and global memory.
2010 Fermi Architecture
Built on a 40 nm process with 512 CUDA cores, Fermi introduced a GigaThread Engine (global scheduler), multiple GPCs, and per‑SM resources such as dual warp schedulers, larger register files, and L1/L2 caches.
New Features
GigaThread Engine for concurrent kernel execution.
Raster and PolyMorph engines for graphics.
FMA (fused multiply‑add) instruction.
L1/L2 caches and ECC support.
Unified virtual memory.
2012 Kepler Architecture
Kepler (28 nm) introduced Dynamic Parallelism, Hyper‑Q (multiple hardware queues), Grid Management Unit, Shuffle instructions for intra‑warp communication, GPU Direct RDMA, and Nsight profiling tools.
2014 Maxwell Architecture
Maxwell focused on power‑efficient graphics, introducing the SMM (small SM) design, removing FP64 units, and improving raster engines.
2016 Pascal Architecture
Pascal split into two product lines: the high‑performance GP100 for HPC/AI (60 SMs, 3840 CUDA cores) and the consumer‑oriented GP104. Key innovations:
HBM2 memory (up to 720 GB/s bandwidth).
FP16 support and mixed‑precision capabilities.
NVLink 1.0 (4 × 40 GB/s per GPU) and the DGX‑1 topology.
Unified Virtual Memory (UVM) for seamless CPU‑GPU memory sharing.
2017 Volta Architecture
Volta (12 nm) added Tensor Cores, larger L2 cache, and a new SM design (GV100) with separate FP32 and INT32 pipelines, enabling simultaneous execution of floating‑point and integer instructions.
New Features
Tensor Cores for matrix‑multiply‑accumulate.
Improved warp scheduling with independent PC per thread.
Thread Cooperative Groups for fine‑grained synchronization.
2018 Turing Architecture
Turing (12 nm) combined RT cores for real‑time ray tracing, Tensor Cores (2.0) supporting INT8/INT4, and a unified SM design. CUDA cores, Tensor Cores, and RT cores cannot be used simultaneously.
2020 Ampere Architecture
Ampere (7 nm) introduced Tensor Core 3.0, TF32 data type, NVLink 2.0 (6 × 50 GB/s), MIG (Multi‑Instance GPU) for hardware partitioning, and CUDA 11 asynchronous programming features.
New Features
TF32 for AI‑optimized single‑precision.
NVLink 3.0 and NVSwitch 2.0 for higher inter‑GPU bandwidth.
Warp‑level sync/reduce operations.
L2 cache residency control for kernel‑to‑kernel communication.
2022 Hopper Architecture
Hopper (4 nm) targets large‑scale transformer models with a dedicated Transformer Engine, FP8 data type (up to 4000 TFLOPS sparse), Tensor Core 4.0 (4 × 8 × 16), TMA (Tensor Memory Accelerator), DSMEM intra‑SM networking, NVLink 4.0 (18 × 50 GB/s), and NVSwitch 3.0.
New Features
FP8 for AI‑optimized precision.
Transformer Engine for fused attention/MHA operations.
DSMEM for fast SM‑to‑SM communication.
NVLink‑C2C in the Grace‑Hopper SuperChip (GH200) enabling unified CPU‑GPU memory space.
2024 Blackwell Architecture
Details are forthcoming.
References
WeChat article and GitHub repository .
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
