How NVIDIA’s Fermi Architecture Revolutionized GPU Computing: Key Improvements Explained
Fermi, NVIDIA’s 2010 GPU architecture, introduced major upgrades over the Tesla line—including a 40 nm process, vastly increased transistor count, GDDR5 memory, L2 cache, enhanced FP64 performance, ECC support, and unified CPU‑GPU addressing—making it the first truly complete GPU computing platform.
1. Fermi Architecture Improvements
1.1 Process Technology
The original Tesla G80 used TSMC 90 nm, limiting transistor count to about 700 million and SM count to 16. GT200 (2008) moved to 55 nm, doubling transistors to ~1.4 billion and increasing CUDA cores from 128 to 240. Fermi (GF100, 2010) adopted a 40 nm process, raising transistor count to ~3 billion and CUDA cores to 512.
1.2 Memory System
Fermi upgraded the memory interface to GDDR5, raising effective data rates from 2.2 Gbps (GDDR3) to 3.1 Gbps and adding ECC support for greater reliability in scientific computing.
1.3 Cache System
Unlike the first‑generation Tesla, which lacked L2 cache, Fermi introduced a 768 KB L2 cache that can be shared by all SMs, reducing DRAM accesses. The shared memory per SM grew from 48 KB to 64 KB and can be configured as L1 cache (e.g., 48 KB shared + 16 KB L1 or 16 KB shared + 48 KB L1).
1.4 Compute Units
Fermi replaced the original SP with CUDA cores that support fused multiply‑add (FMA) instructions, enabling two FP32 operations per cycle and a single FP64 operation per cycle, dramatically improving double‑precision performance.
1.5 Scheduler Units
Each SM contains two Warp Schedulers and two Dispatch Units, allowing two warps to be issued concurrently, which boosts parallel execution efficiency.
2. Tesla M2070 Performance
2.1 FP32 Performance
FP32算力 = 核工作频率 × CUDA核数量 × 每个CUDA核每周期 FP32 操作With a shader clock of 1.15 GHz and 14 active SMs (448 CUDA cores), the M2070 achieves approximately 1,030.4 GFLOPS, about three times the performance of the previous G80‑based GeForce 8800 Ultra.
2.2 FP64 Performance
FP64算力 = 核工作频率 × SM数量 × 每个CUDA核每周期 FP64 操作The same configuration yields roughly 515.2 GFLOPS of double‑precision throughput.
2.3 Memory Bandwidth
内存带宽 = 内存位宽 * 数据频率 / 8G80’s GDDR3 (2.2 Gbps) provides 105.6 GB/s, while Fermi’s GDDR5 (3.1 Gbps) raises the bandwidth to 148.8 GB/s.
3. CUDA Optimizations
Fermi introduced a unified virtual address space, allowing developers to allocate memory with cudaMallocManaged() that is accessible by both CPU and GPU without explicit copies.
float *data;
cudaMallocManaged(&data, N * sizeof(float)); // unified allocation
for (int i = 0; i < N; i++) data[i] = i; // CPU initializes
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize(); // implicit data transferConclusion
Double‑Precision Acceleration : FP64 performance is dramatically increased, making GPUs competitive with CPUs for scientific workloads.
ECC Memory : Hardware error correction provides data‑center‑grade reliability.
Unified Memory Addressing : Eliminates manual data transfers between host and device.
L2 Cache : Improves memory access latency and bandwidth.
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
