Fundamentals 13 min read

How Kepler Boosted GPU Performance: Architecture, Specs, and Compute Power

This article examines NVIDIA's Kepler GPU architecture, highlighting its 28 nm process, increased transistor count, expanded CUDA core count, PCIe 3.0 support, enhanced memory hierarchy, new compute units, scheduling improvements like Hyper‑Q, and performance metrics of the Tesla K20X, illustrating the substantial gains over previous generations.

Refining Core Development Skills
Refining Core Development Skills
Refining Core Development Skills
How Kepler Boosted GPU Performance: Architecture, Specs, and Compute Power

1. Kepler Architecture Improvements

1.1 Process Technology

NVIDIA's first Tesla G80 used TSMC 90 nm, the Fermi GF100 used 40 nm, and the 2012 Kepler architecture moved to 28 nm, allowing many more transistors and modules on the chip.

We use the GK110 variant as a representative example.

Model

G80 (Tesla, 2006)

GT200 (Tesla, 2008)

GF100 (Fermi, 2010)

GK110 (Kepler, 2012)

Process

TSMC 90 nm

TSMC 55 nm

TSMC 40 nm

TSMC 28 nm

Transistor count

~0.7 B

~1.4 B

~3.0 B

~7.1 B

SM count

16

30

16

15 SMX

CUDA cores

128 (8/SM)

240 (8/SM)

512 (32/SM)

2880 (192/SM)

The 28 nm process raised the transistor count to 7.1 billion, while the SM count stayed similar; however, each SM now contains 192 CUDA cores, a 5× increase over Fermi.

1.2 External Interconnect

GPU‑CPU communication uses PCI Express. Fermi supported PCIe 2.0 (5 GT/s), whereas Kepler upgraded to PCIe 3.0, offering 8 GT/s in x16 mode, improving data transfer bandwidth.

1.3 Memory System

Kepler retains the shared memory, L1, L2, and DRAM hierarchy of Fermi but adds a 48 KB read‑only data cache. L2 cache capacity doubled from 768 KB to 1536 KB, and six memory controllers connect to DRAM.

48 KB shared memory + 16 KB L1 cache

16 KB shared memory + 48 KB L1 cache

32 KB shared memory + 32 KB L1 cache (new configuration)

1.4 Compute Units

SM (streaming multiprocessor) is the core of parallel execution.

Tesla: 8 SP per SM (later called CUDA cores)

Fermi: 32 CUDA cores per SM

Kepler’s SMX contains:

192 single‑precision CUDA cores

64 double‑precision DP units

32 SFU (special function units)

32 load/store units

FP64 units are independent, giving each SM 64 DP units, dramatically improving double‑precision performance.

1.5 Scheduling Capability

Kepler adds more warp schedulers and instruction dispatch units: each SMX has four warp schedulers and eight instruction dispatch units, allowing four warps to be issued simultaneously.

It also introduces Hyper‑Q, providing 32 hardware work queues so multiple CPU cores can feed the GPU concurrently, reducing idle time and increasing utilization.

2. Tesla K20X Performance

2.1 FP32 Performance

FP32 peak performance is calculated as:

FP32 Performance = Core Frequency × CUDA Core Count × Operations per Clock

The K20X runs at 732 MHz with 14 active SMs, totaling 2688 CUDA cores. Each core performs two FP32 operations per clock (FMA).

FP32 Performance = 0.732 GHz × 2688 × 2 = 3.935 TFLOPS

2.2 FP64 Performance

Kepler’s SMX has a 1:3 ratio of FP64 to FP32 units (64 DP units per SM). With 14 SMs, the K20X provides 896 DP units.

FP64 Performance = 0.732 GHz × 896 × 2 = 1.312 TFLOPS

2.3 Memory Bandwidth

Kepler still uses GDDR5 but raises the memory clock from 783 MHz (Fermi) to 1300 MHz, increasing effective data rate from 3.1 Gbps to 5.2 Gbps.

Memory Bandwidth = 384‑bit × 5.2 Gbps / 8 = 249.6 GB/s

Conclusion

Process technology moved from 40 nm (Fermi) to 28 nm, raising transistor count to 7.1 billion.

PCIe 3.0 support increases x16 bandwidth from 5 GT/s to 8 GT/s.

L2 cache doubled to 1536 KB; a 48 KB read‑only cache was added.

SMX now holds 192 CUDA cores, 64 DP units, and more cache, boosting compute and memory throughput.

Scheduling improvements (more warp schedulers, Hyper‑Q) reduce CPU‑GPU bottlenecks.

FP32 performance rose from 1.03 TFLOPS (Fermi) to 3.935 TFLOPS; memory bandwidth grew from 148.8 GB/s to 249.6 GB/s.

Next, the Maxwell architecture will be explored.

PerformanceArchitectureCUDAGPUcomputekepler
Refining Core Development Skills
Written by

Refining Core Development Skills

Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.