Fundamentals 42 min read

Fundamentals of GPU Architecture and Programming

The article explains GPU fundamentals—from the end of Dennard scaling and why GPUs excel in parallel throughput, through CUDA programming basics like the SAXPY kernel and SIMT versus SIMD execution, to the evolution of the SIMT stack, modern scheduling, and a three‑step core architecture design.

Tencent Technical Engineering

Mar 21, 2025

Fundamentals of GPU Architecture and Programming

Author: leowwlwang

"How much did you pay for your 4090?" "The H100 performance is amazing!" GPU price‑performance is a hot topic, but beyond the numbers the author wants to understand the underlying principles. This article compiles the author’s notes on GPU operation principles, programming models, and architectural design to share with readers.

1. Introduction

Why GPU?

GPUs are chosen not only for raw speed but also for their combination of generality and efficiency, making them the preferred platform for high‑performance computing.

In 1974 Dennard et al. proposed the Dennard scaling law: as transistor dimensions shrink, power density stays constant, allowing more transistors on a chip without increasing overall power consumption.

Dennard 缩放比例定律 (Dennard Scaling)：当晶体管特征尺寸缩小时，其功率密度保持恒定。具体表现为电压随特征尺寸线性下降，电流密度保持稳定，使得单位面积的功耗与晶体管尺寸成比例关系。

From the 1970s to early 2000s, Dennard scaling drove performance growth. Around 2005‑2007, quantum tunneling caused leakage currents to rise exponentially, the scaling law broke down, and the “power wall” appeared.

Specialized hardware such as Google’s TPU emerged, but they target narrow workloads. GPUs, originally designed for graphics, evolved a highly parallel SIMT (Single Instruction, Multiple Threads) architecture that now serves general‑purpose computing, from CUDA‑based deep‑learning training to OpenCL‑accelerated fluid simulation.

GPU’s "Speed"

Why is it fast?

High compute concurrency : A larger portion of the die is devoted to stream processors (CUDA cores) and less to control logic, yielding higher throughput per area.

Low memory latency : By running many threads per core, GPUs can hide global‑memory latency. When a warp stalls for memory, another warp can be scheduled.

Specialized memory and compute units : High‑bandwidth memory (GDDR6, HBM) and dedicated units such as Tensor cores accelerate specific workloads.

How fast?

Theoretical FLOPS : Typically expressed in TFLOPS. The common "CUDA core" formula is used.

# CUDA核心计算法<br/>算力（FLOPS）= CUDA核心数 × 加速频率 × 每核心单个周期浮点计算系数<br/># 以A100为例<br/>A100的算力（FP32单精度）= 6912 × 1.41GHz × 2 = 19491.84 GFLOPS ≈ 19.5 TFLOPS

Practical measurement : Tools such as GPU‑Z (real‑time monitoring) and 3DMark (benchmark) are recommended. GEMM (General Matrix Multiplication) is also used to approach theoretical peaks.

算力 = 总操作数 / 执行时间 = A(M, K) × B(K, N) / T = 2 × M × N × K / T

2. GPU Programming

This section shows how to write non‑graphics GPU code, focusing on the programming model rather than a tutorial.

Program execution example: SAXPY

SAXPY computes y = a·x + y. First, the CPU version in C:

// SAXPY函数实现<br/>void saxpy(int n, float a, float *x, float *y) {<br/>    for (int i = 0; i < n; i++) {<br/>        y[i] = a * x[i] + y[i];<br/>    }<br/>}<br/><br/>int main() {<br/>    float a = 2.0;<br/>    int n;<br/>    float *x; // host vector x<br/>    float *y; // host vector y<br/>    // ... allocate, fill, set n ...<br/>    saxpy(n, a, x, y);<br/>    return 0;<br/>}

Converted to CUDA to run on the GPU:

__global__ void saxpy(int n, float a, float *x, float *y) {<br/>    int i = blockIdx.x * blockDim.x + threadIdx.x;<br/>    if (i < n) {<br/>        y[i] = a * x[i] + y[i];<br/>    }<br/>}<br/><br/>int main() {<br/>    float a = 2.0;<br/>    int n;<br/>    float *hx; // host x<br/>    float *hy; // host y<br/>    // ... allocate, fill, set n ...<br/>    int vector_size = n * sizeof(float);<br/>    float *dx; // device x<br/>    float *dy; // device y<br/>    cudaMalloc(&dx, vector_size);<br/>    cudaMalloc(&dy, vector_size);<br/>    cudaMemcpy(dx, hx, vector_size, cudaMemcpyHostToDevice);<br/>    cudaMemcpy(dy, hy, vector_size, cudaMemcpyHostToDevice);<br/>    int t = 256; // threads per block<br/>    int blocks_num = (n + t - 1) / t;<br/>    saxpy<<<blocks_num, t>>>(n, a, dx, dy);<br/>    cudaMemcpy(hy, dy, vector_size, cudaMemcpyDeviceToHost);<br/>    return 0;<br/>}

The __global__ qualifier marks a kernel that runs on the device; main runs on the host. cudaMalloc and cudaMemcpy manage GPU memory. Unified memory is also mentioned as a newer alternative.

Thread organization

CUDA uses a hierarchy: thread → thread block → grid. A warp (32 threads) is the hardware execution unit. The article shows a 3‑D launch example:

// host side launch code<br/>void launch_kernel_3d() {<br/>    int dimX = 64; int dimY = 32; int dimZ = 16;<br/>    dim3 blockSize(8,4,4); // 128 threads per block<br/>    dim3 gridSize( (dimX+blockSize.x-1)/blockSize.x,<br/>                  (dimY+blockSize.y-1)/blockSize.y,<br/>                  (dimZ+blockSize.z-1)/blockSize.z );<br/>    kernel_3d<<<gridSize, blockSize>>>(d_data, dimX, dimY, dimZ);<br/>}<br/><br/>__global__ void kernel_3d(float* data, int dimX, int dimY, int dimZ) {<br/>    int x = blockIdx.x * blockDim.x + threadIdx.x;<br/>    int y = blockIdx.y * blockDim.y + threadIdx.y;<br/>    int z = blockIdx.z * blockDim.z + threadIdx.z;<br/>    if (x < dimX && y < dimY && z < dimZ) {<br/>        int idx = x + y*dimX + z*dimX*dimY;<br/>        data[idx] *= 2.0f;<br/>    }<br/>}

SIMT vs SIMD

SIMT (Single Instruction, Multiple Threads) abstracts SIMD (Single Instruction, Multiple Data) by assigning each thread its own registers and program counter, while hardware still executes groups of 32 threads (warps) in lock‑step.

Branch divergence occurs when threads in the same warp take different paths. The GPU serializes the divergent paths, using a mask to keep inactive threads idle.

SIMT Stack (pre‑Volta)

Early GPUs used a hardware stack to track branch convergence points. The stack stores:

Convergence PC (where the paths re‑join)

Next‑PC for each path

Active mask (which threads are active for the path)

When a divergence is encountered, the convergence point is pushed, then each path is pushed. Execution proceeds depth‑first, popping when a path finishes.

Problems with the stack

Fixed depth limits (4‑8 levels) cannot handle deep control flow.

Each warp needs its own stack, consuming registers.

Frequent push/pop adds latency.

LIFO order can cause load imbalance or deadlock.

Independent Thread Scheduling (Volta and later)

From Volta onward each thread has its own PC and active flag, allowing divergent threads to progress simultaneously while still being scheduled in warp granularity.

Stackless Branch Reconvergence

Volta replaces the stack with convergence barriers. Threads register their participation (ADD) and later wait (WAIT) at the reconvergence point. This uses only a 32‑bit mask per warp, eliminating depth limits and reducing hardware cost.

Three‑step GPU core design

Step 1 – Minimal usable system

Components: Fetch, Decode, SIMT Stack, Issue, ALU, MEM. This executes one instruction at a time.

Step 2 – Dynamic instruction scheduling

Introduce I‑Cache, I‑Buffer, and a ScoreBoard to track RAW/WAR/WAW hazards. The ScoreBoard records which registers are pending writes; an instruction can be issued only when its source registers are free.

// Simplified scoreboard operation (conceptual)<br/>if (srcRegs & ScoreBoard == 0) { // no hazard<br/>    issueInstruction();<br/>    ScoreBoard |= dstReg; // mark destination as pending<br/>}<br/>// on write‑back<br/>ScoreBoard &= ~dstReg;

Step 3 – High‑throughput data supply

Register files are banked. Each bank is single‑ported; multiple banks allow parallel accesses. Bank conflicts are mitigated by:

Interleaved register allocation

Dynamic bank allocation

Compiler‑driven register assignment

Hybrid banking schemes

When conflicts still occur, the Operand Collector buffers pending reads and schedules them in later cycles, ensuring the ALU receives all operands without stalling the pipeline.

4. Summary

The article traced GPU evolution from the end of Dennard scaling to modern high‑performance GPUs, explained why GPUs are fast, introduced CUDA programming with a SAXPY example, compared SIMT and SIMD, detailed the historical SIMT stack and its modern replacements, and finally built a conceptual three‑step GPU core architecture covering instruction fetch, dependency tracking, and banked register access.

These fundamentals provide a solid foundation for readers to explore deeper GPU topics such as ray tracing, AI inference, and upcoming architectural innovations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CUDA parallel computing GPU GPU programming hardware architecture SIMT

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Introduction

Why GPU?

GPU’s "Speed"

Why is it fast?

How fast?

2. GPU Programming

Program execution example: SAXPY

Thread organization

SIMT vs SIMD

SIMT Stack (pre‑Volta)

Problems with the stack

Independent Thread Scheduling (Volta and later)

Stackless Branch Reconvergence

Three‑step GPU core design

Step 1 – Minimal usable system

Step 2 – Dynamic instruction scheduling

Step 3 – High‑throughput data supply

4. Summary

Tencent Technical Engineering

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Minimal usable system

Step 2 – Dynamic instruction scheduling

Step 3 – High‑throughput data supply