How Kepler Boosted GPU Performance: Architecture, Specs, and Compute Power
This article examines NVIDIA's Kepler GPU architecture, highlighting its 28 nm process, increased transistor count, expanded CUDA core count, PCIe 3.0 support, enhanced memory hierarchy, new compute units, scheduling improvements like Hyper‑Q, and performance metrics of the Tesla K20X, illustrating the substantial gains over previous generations.
1. Kepler Architecture Improvements
1.1 Process Technology
NVIDIA's first Tesla G80 used TSMC 90 nm, the Fermi GF100 used 40 nm, and the 2012 Kepler architecture moved to 28 nm, allowing many more transistors and modules on the chip.
We use the GK110 variant as a representative example.
Model
G80 (Tesla, 2006)
GT200 (Tesla, 2008)
GF100 (Fermi, 2010)
GK110 (Kepler, 2012)
Process
TSMC 90 nm
TSMC 55 nm
TSMC 40 nm
TSMC 28 nm
Transistor count
~0.7 B
~1.4 B
~3.0 B
~7.1 B
SM count
16
30
16
15 SMX
CUDA cores
128 (8/SM)
240 (8/SM)
512 (32/SM)
2880 (192/SM)
The 28 nm process raised the transistor count to 7.1 billion, while the SM count stayed similar; however, each SM now contains 192 CUDA cores, a 5× increase over Fermi.
1.2 External Interconnect
GPU‑CPU communication uses PCI Express. Fermi supported PCIe 2.0 (5 GT/s), whereas Kepler upgraded to PCIe 3.0, offering 8 GT/s in x16 mode, improving data transfer bandwidth.
1.3 Memory System
Kepler retains the shared memory, L1, L2, and DRAM hierarchy of Fermi but adds a 48 KB read‑only data cache. L2 cache capacity doubled from 768 KB to 1536 KB, and six memory controllers connect to DRAM.
48 KB shared memory + 16 KB L1 cache
16 KB shared memory + 48 KB L1 cache
32 KB shared memory + 32 KB L1 cache (new configuration)
1.4 Compute Units
SM (streaming multiprocessor) is the core of parallel execution.
Tesla: 8 SP per SM (later called CUDA cores)
Fermi: 32 CUDA cores per SM
Kepler’s SMX contains:
192 single‑precision CUDA cores
64 double‑precision DP units
32 SFU (special function units)
32 load/store units
FP64 units are independent, giving each SM 64 DP units, dramatically improving double‑precision performance.
1.5 Scheduling Capability
Kepler adds more warp schedulers and instruction dispatch units: each SMX has four warp schedulers and eight instruction dispatch units, allowing four warps to be issued simultaneously.
It also introduces Hyper‑Q, providing 32 hardware work queues so multiple CPU cores can feed the GPU concurrently, reducing idle time and increasing utilization.
2. Tesla K20X Performance
2.1 FP32 Performance
FP32 peak performance is calculated as:
FP32 Performance = Core Frequency × CUDA Core Count × Operations per ClockThe K20X runs at 732 MHz with 14 active SMs, totaling 2688 CUDA cores. Each core performs two FP32 operations per clock (FMA).
FP32 Performance = 0.732 GHz × 2688 × 2 = 3.935 TFLOPS2.2 FP64 Performance
Kepler’s SMX has a 1:3 ratio of FP64 to FP32 units (64 DP units per SM). With 14 SMs, the K20X provides 896 DP units.
FP64 Performance = 0.732 GHz × 896 × 2 = 1.312 TFLOPS2.3 Memory Bandwidth
Kepler still uses GDDR5 but raises the memory clock from 783 MHz (Fermi) to 1300 MHz, increasing effective data rate from 3.1 Gbps to 5.2 Gbps.
Memory Bandwidth = 384‑bit × 5.2 Gbps / 8 = 249.6 GB/sConclusion
Process technology moved from 40 nm (Fermi) to 28 nm, raising transistor count to 7.1 billion.
PCIe 3.0 support increases x16 bandwidth from 5 GT/s to 8 GT/s.
L2 cache doubled to 1536 KB; a 48 KB read‑only cache was added.
SMX now holds 192 CUDA cores, 64 DP units, and more cache, boosting compute and memory throughput.
Scheduling improvements (more warp schedulers, Hyper‑Q) reduce CPU‑GPU bottlenecks.
FP32 performance rose from 1.03 TFLOPS (Fermi) to 3.935 TFLOPS; memory bandwidth grew from 148.8 GB/s to 249.6 GB/s.
Next, the Maxwell architecture will be explored.
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
