Fundamentals of GPU Architecture and Programming
The article explains GPU fundamentals—from the end of Dennard scaling and why GPUs excel in parallel throughput, through CUDA programming basics like the SAXPY kernel and SIMT versus SIMD execution, to the evolution of the SIMT stack, modern scheduling, and a three‑step core architecture design.
Author: leowwlwang
"How much did you pay for your 4090?" "The H100 performance is amazing!" GPU price‑performance is a hot topic, but beyond the numbers the author wants to understand the underlying principles. This article compiles the author’s notes on GPU operation principles, programming models, and architectural design to share with readers.
1. Introduction
Why GPU?
GPUs are chosen not only for raw speed but also for their combination of generality and efficiency, making them the preferred platform for high‑performance computing.
In 1974 Dennard et al. proposed the Dennard scaling law: as transistor dimensions shrink, power density stays constant, allowing more transistors on a chip without increasing overall power consumption.
Dennard 缩放比例定律 (Dennard Scaling):当晶体管特征尺寸缩小时,其功率密度保持恒定。具体表现为电压随特征尺寸线性下降,电流密度保持稳定,使得单位面积的功耗与晶体管尺寸成比例关系。From the 1970s to early 2000s, Dennard scaling drove performance growth. Around 2005‑2007, quantum tunneling caused leakage currents to rise exponentially, the scaling law broke down, and the “power wall” appeared.
Specialized hardware such as Google’s TPU emerged, but they target narrow workloads. GPUs, originally designed for graphics, evolved a highly parallel SIMT (Single Instruction, Multiple Threads) architecture that now serves general‑purpose computing, from CUDA‑based deep‑learning training to OpenCL‑accelerated fluid simulation.
GPU’s "Speed"
Why is it fast?
High compute concurrency : A larger portion of the die is devoted to stream processors (CUDA cores) and less to control logic, yielding higher throughput per area.
Low memory latency : By running many threads per core, GPUs can hide global‑memory latency. When a warp stalls for memory, another warp can be scheduled.
Specialized memory and compute units : High‑bandwidth memory (GDDR6, HBM) and dedicated units such as Tensor cores accelerate specific workloads.
How fast?
Theoretical FLOPS : Typically expressed in TFLOPS. The common "CUDA core" formula is used.
# CUDA核心计算法
算力(FLOPS)= CUDA核心数 × 加速频率 × 每核心单个周期浮点计算系数
# 以A100为例
A100的算力(FP32单精度)= 6912 × 1.41GHz × 2 = 19491.84 GFLOPS ≈ 19.5 TFLOPSPractical measurement : Tools such as GPU‑Z (real‑time monitoring) and 3DMark (benchmark) are recommended. GEMM (General Matrix Multiplication) is also used to approach theoretical peaks.
算力 = 总操作数 / 执行时间 = A(M, K) × B(K, N) / T = 2 × M × N × K / T2. GPU Programming
This section shows how to write non‑graphics GPU code, focusing on the programming model rather than a tutorial.
Program execution example: SAXPY
SAXPY computes y = a·x + y. First, the CPU version in C:
// SAXPY函数实现
void saxpy(int n, float a, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
int main() {
float a = 2.0;
int n;
float *x; // host vector x
float *y; // host vector y
// ... allocate, fill, set n ...
saxpy(n, a, x, y);
return 0;
}Converted to CUDA to run on the GPU:
__global__ void saxpy(int n, float a, float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a * x[i] + y[i];
}
}
int main() {
float a = 2.0;
int n;
float *hx; // host x
float *hy; // host y
// ... allocate, fill, set n ...
int vector_size = n * sizeof(float);
float *dx; // device x
float *dy; // device y
cudaMalloc(&dx, vector_size);
cudaMalloc(&dy, vector_size);
cudaMemcpy(dx, hx, vector_size, cudaMemcpyHostToDevice);
cudaMemcpy(dy, hy, vector_size, cudaMemcpyHostToDevice);
int t = 256; // threads per block
int blocks_num = (n + t - 1) / t;
saxpy<<
>>(n, a, dx, dy);
cudaMemcpy(hy, dy, vector_size, cudaMemcpyDeviceToHost);
return 0;
}The __global__ qualifier marks a kernel that runs on the device; main runs on the host. cudaMalloc and cudaMemcpy manage GPU memory. Unified memory is also mentioned as a newer alternative.
Thread organization
CUDA uses a hierarchy: thread → thread block → grid. A warp (32 threads) is the hardware execution unit. The article shows a 3‑D launch example:
// host side launch code
void launch_kernel_3d() {
int dimX = 64; int dimY = 32; int dimZ = 16;
dim3 blockSize(8,4,4); // 128 threads per block
dim3 gridSize( (dimX+blockSize.x-1)/blockSize.x,
(dimY+blockSize.y-1)/blockSize.y,
(dimZ+blockSize.z-1)/blockSize.z );
kernel_3d<<
>>(d_data, dimX, dimY, dimZ);
}
__global__ void kernel_3d(float* data, int dimX, int dimY, int dimZ) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
if (x < dimX && y < dimY && z < dimZ) {
int idx = x + y*dimX + z*dimX*dimY;
data[idx] *= 2.0f;
}
}SIMT vs SIMD
SIMT (Single Instruction, Multiple Threads) abstracts SIMD (Single Instruction, Multiple Data) by assigning each thread its own registers and program counter, while hardware still executes groups of 32 threads (warps) in lock‑step.
Branch divergence occurs when threads in the same warp take different paths. The GPU serializes the divergent paths, using a mask to keep inactive threads idle.
SIMT Stack (pre‑Volta)
Early GPUs used a hardware stack to track branch convergence points. The stack stores:
Convergence PC (where the paths re‑join)
Next‑PC for each path
Active mask (which threads are active for the path)
When a divergence is encountered, the convergence point is pushed, then each path is pushed. Execution proceeds depth‑first, popping when a path finishes.
Problems with the stack
Fixed depth limits (4‑8 levels) cannot handle deep control flow.
Each warp needs its own stack, consuming registers.
Frequent push/pop adds latency.
LIFO order can cause load imbalance or deadlock.
Independent Thread Scheduling (Volta and later)
From Volta onward each thread has its own PC and active flag, allowing divergent threads to progress simultaneously while still being scheduled in warp granularity.
Stackless Branch Reconvergence
Volta replaces the stack with convergence barriers. Threads register their participation (ADD) and later wait (WAIT) at the reconvergence point. This uses only a 32‑bit mask per warp, eliminating depth limits and reducing hardware cost.
Three‑step GPU core design
Step 1 – Minimal usable system
Components: Fetch, Decode, SIMT Stack, Issue, ALU, MEM. This executes one instruction at a time.
Step 2 – Dynamic instruction scheduling
Introduce I‑Cache, I‑Buffer, and a ScoreBoard to track RAW/WAR/WAW hazards. The ScoreBoard records which registers are pending writes; an instruction can be issued only when its source registers are free.
// Simplified scoreboard operation (conceptual)
if (srcRegs & ScoreBoard == 0) { // no hazard
issueInstruction();
ScoreBoard |= dstReg; // mark destination as pending
}
// on write‑back
ScoreBoard &= ~dstReg;Step 3 – High‑throughput data supply
Register files are banked. Each bank is single‑ported; multiple banks allow parallel accesses. Bank conflicts are mitigated by:
Interleaved register allocation
Dynamic bank allocation
Compiler‑driven register assignment
Hybrid banking schemes
When conflicts still occur, the Operand Collector buffers pending reads and schedules them in later cycles, ensuring the ALU receives all operands without stalling the pipeline.
4. Summary
The article traced GPU evolution from the end of Dennard scaling to modern high‑performance GPUs, explained why GPUs are fast, introduced CUDA programming with a SAXPY example, compared SIMT and SIMD, detailed the historical SIMT stack and its modern replacements, and finally built a conceptual three‑step GPU core architecture covering instruction fetch, dependency tracking, and banked register access.
These fundamentals provide a solid foundation for readers to explore deeper GPU topics such as ray tracing, AI inference, and upcoming architectural innovations.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.