Fundamentals 12 min read

How NVIDIA’s Fermi Architecture Revolutionized GPU Computing: Key Improvements Explained

Fermi, NVIDIA’s 2010 GPU architecture, introduced major upgrades over the Tesla line—including a 40 nm process, vastly increased transistor count, GDDR5 memory, L2 cache, enhanced FP64 performance, ECC support, and unified CPU‑GPU addressing—making it the first truly complete GPU computing platform.

Refining Core Development Skills

Aug 26, 2025

How NVIDIA’s Fermi Architecture Revolutionized GPU Computing: Key Improvements Explained

1. Fermi Architecture Improvements

1.1 Process Technology

The original Tesla G80 used TSMC 90 nm, limiting transistor count to about 700 million and SM count to 16. GT200 (2008) moved to 55 nm, doubling transistors to ~1.4 billion and increasing CUDA cores from 128 to 240. Fermi (GF100, 2010) adopted a 40 nm process, raising transistor count to ~3 billion and CUDA cores to 512.

1.2 Memory System

Fermi upgraded the memory interface to GDDR5, raising effective data rates from 2.2 Gbps (GDDR3) to 3.1 Gbps and adding ECC support for greater reliability in scientific computing.

1.3 Cache System

Unlike the first‑generation Tesla, which lacked L2 cache, Fermi introduced a 768 KB L2 cache that can be shared by all SMs, reducing DRAM accesses. The shared memory per SM grew from 48 KB to 64 KB and can be configured as L1 cache (e.g., 48 KB shared + 16 KB L1 or 16 KB shared + 48 KB L1).

1.4 Compute Units

Fermi replaced the original SP with CUDA cores that support fused multiply‑add (FMA) instructions, enabling two FP32 operations per cycle and a single FP64 operation per cycle, dramatically improving double‑precision performance.

1.5 Scheduler Units

Each SM contains two Warp Schedulers and two Dispatch Units, allowing two warps to be issued concurrently, which boosts parallel execution efficiency.

2. Tesla M2070 Performance

2.1 FP32 Performance

FP32算力 = 核工作频率 × CUDA核数量 × 每个CUDA核每周期 FP32 操作

With a shader clock of 1.15 GHz and 14 active SMs (448 CUDA cores), the M2070 achieves approximately 1,030.4 GFLOPS, about three times the performance of the previous G80‑based GeForce 8800 Ultra.

2.2 FP64 Performance

FP64算力 = 核工作频率 × SM数量 × 每个CUDA核每周期 FP64 操作

The same configuration yields roughly 515.2 GFLOPS of double‑precision throughput.

2.3 Memory Bandwidth

内存带宽 = 内存位宽 * 数据频率 / 8

G80’s GDDR3 (2.2 Gbps) provides 105.6 GB/s, while Fermi’s GDDR5 (3.1 Gbps) raises the bandwidth to 148.8 GB/s.

3. CUDA Optimizations

Fermi introduced a unified virtual address space, allowing developers to allocate memory with cudaMallocManaged() that is accessible by both CPU and GPU without explicit copies.

float *data;
cudaMallocManaged(&data, N * sizeof(float)); // unified allocation
for (int i = 0; i < N; i++) data[i] = i; // CPU initializes
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize(); // implicit data transfer

Conclusion

Double‑Precision Acceleration : FP64 performance is dramatically increased, making GPUs competitive with CPUs for scientific workloads.

ECC Memory : Hardware error correction provides data‑center‑grade reliability.

Unified Memory Addressing : Eliminates manual data transfers between host and device.

L2 Cache : Improves memory access latency and bandwidth.

GPU Architecture CUDA optimization ECC Memory Fermi FP64 performance

Written by

Refining Core Development Skills

Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.