Fundamentals 38 min read

Why GPUs Really Matter: From Architecture Basics to CUDA Programming

This article explains why GPUs have become the preferred platform for high‑performance computing, covering Dennard scaling, GPU speed advantages, theoretical FLOPS calculations, CUDA programming examples like SAXPY, the SIMT execution model, instruction pipelines, and modern techniques for handling branch divergence and register bank conflicts.

Tencent Cloud Developer

Sep 26, 2025

Why GPUs Really Matter: From Architecture Basics to CUDA Programming

Introduction

Discussion of GPU price/performance, Dennard scaling, the end of scaling, and why GPUs became the preferred platform for high‑performance computing.

Why GPUs?

GPUs combine generality and efficiency, offering massive parallelism (SIMT), high memory bandwidth, and specialized units such as Tensor cores.

Theoretical and Measured Performance

FLOPS are calculated by CUDA core count × frequency × per‑cycle operations. Example: NVIDIA A100 achieves ~19.5 TFLOPS (FP32).

Dennard Scaling

(when transistor feature size shrinks, power density stays constant; voltage drops linearly, current density stays stable, so power per area scales with transistor size). In simple terms: smaller transistors use less power.

GPU Programming Basics

Shows a CPU implementation of SAXPY and its CUDA counterpart, illustrating host‑device memory allocation, kernel launch syntax, and thread indexing.

// SAXPY function implementation
void saxpy(int n, float a, float *x, float *y) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}
int main() {
    float a = 2.0;
    // allocate and initialize vectors x and y ...
    saxpy(n, a, x, y);
    return 0;
}

__global__ void saxpy(int n, float a, float *x, float *y) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        y[i] = a * x[i] + y[i];
    }
}
int main() {
    float a = 2.0;
    // allocate host vectors hx, hy and device vectors dx, dy ...
    cudaMalloc(&dx, vector_size);
    cudaMalloc(&dy, vector_size);
    cudaMemcpy(dx, hx, vector_size, cudaMemcpyHostToDevice);
    cudaMemcpy(dy, hy, vector_size, cudaMemcpyHostToDevice);
    int t = 256; // threads per block
    int blocks_num = (n + t - 1) / t;
    saxpy<<<blocks_num, t>>>(n, a, dx, dy);
    cudaMemcpy(hy, dy, vector_size, cudaMemcpyDeviceToHost);
    return 0;
}

Thread Hierarchy

Explains the organization of threads in CUDA.

Thread – basic execution unit.

Thread block – group of threads sharing shared memory.

Warp – hardware unit of 32 threads that execute the same instruction.

Grid – collection of thread blocks covering the whole computation.

AMD terminology (NDRange, Work Group, Wavefront, Work Item) is also listed.

SIMT vs SIMD

SIMT is a thread‑level abstraction of SIMD; each warp executes the same instruction on different data. Branch divergence causes some threads to become idle while others execute a different path. Pre‑Volta GPUs used a SIMT stack to record branch convergence points, while Volta and later use stackless reconvergence with convergence barriers.

Instruction Pipeline

Describes the front‑end (fetch, decode, issue) and back‑end (ALU, memory) of a streaming multiprocessor (SM). The score‑board tracks register write hazards (RAW, WAR, WAW) using a small bitmap per warp; a dependency bit vector determines whether an instruction can be issued. The operand collector gathers required source registers, resolves bank conflicts, and queues accesses when necessary.

Advanced Topics

Independent thread scheduling gives each thread its own program counter, allowing more flexible execution while still scheduling at warp granularity. Convergence barriers (ADD and WAIT) replace the SIMT stack, enabling deep conditional nesting without hardware stack limits.

Register File and Banking

Registers are organized into multiple single‑port banks. Bank conflicts are mitigated by interleaved register allocation, dynamic bank assignment, compiler‑driven register placement, and hybrid banking strategies.

The article concludes with a three‑step approach to understanding GPU architecture: (1) build a minimal functional system, (2) add dynamic instruction scheduling with score‑board and operand collector, and (3) improve data supply efficiency using banked register files and modern divergence handling.

Parallel Computing GPU Performance SIMT CUDA programming GPU Architecture

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.