Why GPUs Really Matter: From Architecture Basics to CUDA Programming
This article explains why GPUs have become the preferred platform for high‑performance computing, covering Dennard scaling, GPU speed advantages, theoretical FLOPS calculations, CUDA programming examples like SAXPY, the SIMT execution model, instruction pipelines, and modern techniques for handling branch divergence and register bank conflicts.
Introduction
Discussion of GPU price/performance, Dennard scaling, the end of scaling, and why GPUs became the preferred platform for high‑performance computing.
Why GPUs?
GPUs combine generality and efficiency, offering massive parallelism (SIMT), high memory bandwidth, and specialized units such as Tensor cores.
Theoretical and Measured Performance
FLOPS are calculated by CUDA core count × frequency × per‑cycle operations. Example: NVIDIA A100 achieves ~19.5 TFLOPS (FP32).
Dennard Scaling
(when transistor feature size shrinks, power density stays constant; voltage drops linearly, current density stays stable, so power per area scales with transistor size). In simple terms: smaller transistors use less power.
GPU Programming Basics
Shows a CPU implementation of SAXPY and its CUDA counterpart, illustrating host‑device memory allocation, kernel launch syntax, and thread indexing.
// SAXPY function implementation
void saxpy(int n, float a, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
int main() {
float a = 2.0;
// allocate and initialize vectors x and y ...
saxpy(n, a, x, y);
return 0;
} __global__ void saxpy(int n, float a, float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a * x[i] + y[i];
}
}
int main() {
float a = 2.0;
// allocate host vectors hx, hy and device vectors dx, dy ...
cudaMalloc(&dx, vector_size);
cudaMalloc(&dy, vector_size);
cudaMemcpy(dx, hx, vector_size, cudaMemcpyHostToDevice);
cudaMemcpy(dy, hy, vector_size, cudaMemcpyHostToDevice);
int t = 256; // threads per block
int blocks_num = (n + t - 1) / t;
saxpy<<<blocks_num, t>>>(n, a, dx, dy);
cudaMemcpy(hy, dy, vector_size, cudaMemcpyDeviceToHost);
return 0;
}Thread Hierarchy
Explains the organization of threads in CUDA.
Thread – basic execution unit.
Thread block – group of threads sharing shared memory.
Warp – hardware unit of 32 threads that execute the same instruction.
Grid – collection of thread blocks covering the whole computation.
AMD terminology (NDRange, Work Group, Wavefront, Work Item) is also listed.
SIMT vs SIMD
SIMT is a thread‑level abstraction of SIMD; each warp executes the same instruction on different data. Branch divergence causes some threads to become idle while others execute a different path. Pre‑Volta GPUs used a SIMT stack to record branch convergence points, while Volta and later use stackless reconvergence with convergence barriers.
Instruction Pipeline
Describes the front‑end (fetch, decode, issue) and back‑end (ALU, memory) of a streaming multiprocessor (SM). The score‑board tracks register write hazards (RAW, WAR, WAW) using a small bitmap per warp; a dependency bit vector determines whether an instruction can be issued. The operand collector gathers required source registers, resolves bank conflicts, and queues accesses when necessary.
Advanced Topics
Independent thread scheduling gives each thread its own program counter, allowing more flexible execution while still scheduling at warp granularity. Convergence barriers (ADD and WAIT) replace the SIMT stack, enabling deep conditional nesting without hardware stack limits.
Register File and Banking
Registers are organized into multiple single‑port banks. Bank conflicts are mitigated by interleaved register allocation, dynamic bank assignment, compiler‑driven register placement, and hybrid banking strategies.
The article concludes with a three‑step approach to understanding GPU architecture: (1) build a minimal functional system, (2) add dynamic instruction scheduling with score‑board and operand collector, and (3) improve data supply efficiency using banked register files and modern divergence handling.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
