From Tesla to Hopper: How NVIDIA GPU Architectures Powered the AI Revolution

This article traces the evolution of NVIDIA GPU architectures—from the early Tesla series through Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, and up to the upcoming Blackwell—explaining their hardware innovations, CUDA programming model, and how each generation enabled breakthroughs in high‑performance computing, deep learning, and AI applications.

AI Cyberspace
AI Cyberspace
AI Cyberspace
From Tesla to Hopper: How NVIDIA GPU Architectures Powered the AI Revolution

Preface

This article reviews the full history of NVIDIA GPU architectures, their technical features, and the CUDA programming model, covering Tesla, Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, and the upcoming Blackwell.

GPU architecture overview
GPU architecture overview

NVIDIA GPGPU and the Birth of CUDA

In 2001, Stanford professor Bill Dally’s team introduced the Stream Processor architecture, which attracted NVIDIA’s attention. Stream processors process large data streams in parallel, emphasizing data flow, parallel computation, and high throughput.

By 2003, Bill Dally became a consultant for NVIDIA, contributing to the Tesla architecture. In 2004, Ian Buck published "Brook for GPUs" and later founded the CUDA project, enabling C++ programming on GPUs.

CUDA 1.0 launched in 2006 with the GeForce 8800 GTX, delivering 345.6 GFLOPS single‑precision performance.

GPU/CUDA in HPC and AI

GPU hardware and CUDA software became the foundation of many supercomputers (e.g., Tianhe‑1, Sunway 6000). The 2012 ImageNet competition marked a turning point when a CUDA‑based AlexNet on a Kepler GPU dramatically reduced error rates, launching the AI boom.

2008 Tesla Architecture

The first unified‑shader Tesla architecture combined vertex and pixel shaders and introduced a full graphics pipeline.

1. DSA Chip Architecture

Tesla block diagram
Tesla block diagram

Key components include Host CPU, System Memory, PCIe Bridge, Host Interface, Input Assembler, Vertex Work Distribution, Pixel Work Distribution, Compute Work Distribution, High‑Definition Video Processors, and others.

Each TPC contains a Geometry controller, SM controller, two SMs, texture units, L1 cache, inter‑connection network, raster pipelines, shared L2 cache, DRAM, and a display interface.

SM (Stream Multi‑Processor)

An SM integrates scalar and vector processing, containing an I‑Cache, instruction issue logic, constant cache, 8 streaming processors (SP), special function units (SFU), shared memory, and registers.

SP (Streaming Processor)

Each SP is a scalar unit capable of FP32 add, multiply‑add, and INT32 operations.

2. New Features of Tesla

Unified memory address space (global, shared, local).

Support for atomic load/store on shared and global memory.

2010 Fermi Architecture

Built on a 40 nm process with 512 CUDA cores, Fermi introduced a GigaThread Engine (global scheduler), multiple GPCs, and per‑SM resources such as dual warp schedulers, larger register files, and L1/L2 caches.

New Features

GigaThread Engine for concurrent kernel execution.

Raster and PolyMorph engines for graphics.

FMA (fused multiply‑add) instruction.

L1/L2 caches and ECC support.

Unified virtual memory.

2012 Kepler Architecture

Kepler (28 nm) introduced Dynamic Parallelism, Hyper‑Q (multiple hardware queues), Grid Management Unit, Shuffle instructions for intra‑warp communication, GPU Direct RDMA, and Nsight profiling tools.

2014 Maxwell Architecture

Maxwell focused on power‑efficient graphics, introducing the SMM (small SM) design, removing FP64 units, and improving raster engines.

2016 Pascal Architecture

Pascal split into two product lines: the high‑performance GP100 for HPC/AI (60 SMs, 3840 CUDA cores) and the consumer‑oriented GP104. Key innovations:

HBM2 memory (up to 720 GB/s bandwidth).

FP16 support and mixed‑precision capabilities.

NVLink 1.0 (4 × 40 GB/s per GPU) and the DGX‑1 topology.

Unified Virtual Memory (UVM) for seamless CPU‑GPU memory sharing.

2017 Volta Architecture

Volta (12 nm) added Tensor Cores, larger L2 cache, and a new SM design (GV100) with separate FP32 and INT32 pipelines, enabling simultaneous execution of floating‑point and integer instructions.

New Features

Tensor Cores for matrix‑multiply‑accumulate.

Improved warp scheduling with independent PC per thread.

Thread Cooperative Groups for fine‑grained synchronization.

2018 Turing Architecture

Turing (12 nm) combined RT cores for real‑time ray tracing, Tensor Cores (2.0) supporting INT8/INT4, and a unified SM design. CUDA cores, Tensor Cores, and RT cores cannot be used simultaneously.

2020 Ampere Architecture

Ampere (7 nm) introduced Tensor Core 3.0, TF32 data type, NVLink 2.0 (6 × 50 GB/s), MIG (Multi‑Instance GPU) for hardware partitioning, and CUDA 11 asynchronous programming features.

New Features

TF32 for AI‑optimized single‑precision.

NVLink 3.0 and NVSwitch 2.0 for higher inter‑GPU bandwidth.

Warp‑level sync/reduce operations.

L2 cache residency control for kernel‑to‑kernel communication.

2022 Hopper Architecture

Hopper (4 nm) targets large‑scale transformer models with a dedicated Transformer Engine, FP8 data type (up to 4000 TFLOPS sparse), Tensor Core 4.0 (4 × 8 × 16), TMA (Tensor Memory Accelerator), DSMEM intra‑SM networking, NVLink 4.0 (18 × 50 GB/s), and NVSwitch 3.0.

New Features

FP8 for AI‑optimized precision.

Transformer Engine for fused attention/MHA operations.

DSMEM for fast SM‑to‑SM communication.

NVLink‑C2C in the Grace‑Hopper SuperChip (GH200) enabling unified CPU‑GPU memory space.

2024 Blackwell Architecture

Details are forthcoming.

References

WeChat article and GitHub repository .

AICUDAGPUNVIDIAHPC
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.