Fundamentals 21 min read

CUDA Optimization Basics: Understanding GPU Architecture and Warp Scheduling

This article explains the fundamentals of CUDA performance tuning, covering GPU architectures from Kepler to Volta, the role of SMX, warp schedulers, registers and memory hierarchies, and provides practical guidance on launch configuration, latency hiding, and thread‑block sizing to maximize throughput.

Linux Kernel Journey

Dec 7, 2025

CUDA Optimization Basics: Understanding GPU Architecture and Warp Scheduling

Core CUDA Optimization Topics

Three primary factors affect CUDA performance: kernel launch configuration (using many threads), global‑memory throughput, and shared‑memory utilization. The concepts are illustrated with CUDA C++ but apply to other CUDA languages.

Kepler SMX Architecture

SMX (Streaming Multiprocessor X) : enhanced SM that executes warps in parallel; more SMX units increase overall performance.

192 SP units : single‑precision/ integer cores per SMX.

64 DP units : double‑precision cores for scientific workloads.

LD/ST units : handle memory load and store operations.

64 KB registers per SMX (65 536 × 32‑bit = 256 KB) provide fast private storage for threads.

4 warp schedulers with dual‑issue capability; each can dispatch up to two non‑conflicting instructions per cycle.

Example GPUs: K20 (13 SMX, 5 GB), K20X (14 SMX, 6 GB), K40 (15 SMX, 12 GB).

Instruction Cache, Register File and Execution Units

Instruction cache stores pending kernel instructions.

Execution units :

Core (SP) – single‑precision / integer arithmetic.

DP unit – double‑precision arithmetic.

SFU – special functions (trigonometry, square‑root).

LD/ST unit – memory load/store.

Shared memory / L1 cache (64 KB) can be partitioned (e.g., 48 KB shared + 16 KB L1) for fast intra‑block communication and global‑memory acceleration.

Read‑only data cache (48 KB) speeds up constant and texture memory accesses, useful for immutable data such as neural‑network weights.

Texture units process texture‑mapping operations, converting data from texture memory before passing it to the SPs.

Maxwell / Pascal / Volta Enhancements

Pascal compute capability 6.1 INT8 : hardware acceleration for 8‑bit integer inference, reducing compute and memory cost for AI workloads.

FP16 at 2× SP rate : half‑precision operations execute twice as fast as single‑precision on the same SP.

Volta Tensor Cores : dedicated 4×4 matrix mixed‑precision (FP16) multiply‑add units for deep‑learning acceleration.

Separate INT32 units in Volta : integer and floating‑point calculations can run concurrently, improving resource utilization.

Execution Model and Latency Hiding

Threads are grouped into warps of 32 threads. A warp scheduler selects a ready warp each clock cycle; if a warp stalls because an operand is not ready (e.g., waiting for a global‑memory load), the scheduler switches to another warp, keeping the SM busy.

Load instructions are asynchronous and non‑blocking; subsequent independent instructions can issue immediately.

Dual‑issue warp schedulers can dispatch two non‑conflicting instructions (e.g., one arithmetic and one memory operation) to different execution units in the same cycle.

Typical global‑memory latency exceeds 100 clock cycles, while arithmetic latency is under 100 cycles.

Example: Hiding Global‑Memory Load Latency

int idx = threadIdx.x + blockDim.x * blockIdx.x;
c[idx] = a[idx] * b[idx];

Compiled SASS (assembly) shows the sequence:

I0: LD R0, a[idx];
I1: LD R1, b[idx];
I2: MPY R2, R0, R1;

The first two loads issue asynchronously; while the data is being fetched, the warp scheduler can switch to another warp. Once the loads complete, the multiply instruction executes, demonstrating latency hiding.

Compilation Flow

CUDA C code : developer‑written parallel kernel, e.g., c[idx] = a[idx] * b[idx]; PTX intermediate : portable virtual GPU ISA, e.g., mul.f32 %r2, %r0, %r1; SASS machine code : hardware‑specific instructions executed by the GPU, e.g.,

MPY R2, R0, R1

Strategies to Hide Latency

Launch enough warps: a typical SM can host 64–32 warps (2048–1024 threads). More warps increase the chance of finding a ready warp each cycle.

Utilize L1/L2 caches to reduce global‑memory access time.

Distribute independent instructions across different warp schedulers so they can issue simultaneously.

Hiding Arithmetic Latency

When two instructions are independent, the scheduler can issue them back‑to‑back, overlapping their pipelines. A timeline comparison shows that dependent streams leave execution units idle, whereas independent streams keep both pipelines busy.

Maximizing Global Memory Throughput

Access pattern : coalesced, sequential accesses yield high throughput; random or scattered accesses degrade performance.

Data size : larger vector types (e.g., float2) halve the number of memory transactions compared with float, improving bandwidth utilization.

Bus saturation : achieve high occupancy so many memory requests are in flight, keeping the memory controller busy.

Thread‑level and warp‑level concurrency both contribute to saturating the bus.

An experiment on a 64 M‑element array performing load‑increment‑store shows that increasing the number of active threads raises measured throughput, and that using wider vector types reduces the number of transactions required.

Thread‑Block Configuration Guidelines

Block size should be a multiple of the warp size (32) to avoid idle lanes.

Each SM can concurrently run at least 16 blocks (up to 32 on newer architectures); more blocks improve resource utilization.

Typical block sizes are 128–256 threads, adjusted for memory‑intensive versus compute‑intensive kernels.

Target 512–2048 active threads per SM (ideally 2048) to hide latency effectively.

Occupancy and Its Limits

Occupancy measures the ratio of active warps to the maximum possible on an SM. It is limited by per‑thread register usage, shared‑memory consumption, and block size. NVIDIA provides an occupancy‑calculator spreadsheet to evaluate and improve this metric.

Performance‑Driven Optimization Workflow

Use profiling tools such as Nsight Compute to identify whether compute or memory subsystems are saturated, then apply targeted optimizations based on the observed bottlenecks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization CUDA GPU architecture memory latency warp scheduling register usage thread configuration

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.