Game Development 27 min read

GPU Architecture and Rendering Pipeline Overview

This article provides a comprehensive overview of modern GPU architecture, covering components such as SMs, GPCs, memory hierarchy, unified shader architecture, SIMT execution, warp scheduling, and compares IMR, TBR, and TBDR rendering pipelines while offering practical optimization techniques for developers.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
GPU Architecture and Rendering Pipeline Overview

Modern GPUs consist of multiple Graphics Processing Clusters (GPCs), each containing several Streaming Multiprocessors (SMs) that house compute cores, registers, caches, and specialized units such as the PolyMorph Engine and Raster Engine.

Memory hierarchy in a GPU follows a fast‑to‑slow order: registers → shared memory/L1 cache → L2 cache → texture/constant cache → global DRAM, with access latencies ranging from 1 cycle for registers to 400‑600 cycles for global memory.

The unified shader architecture replaces separate vertex and pixel shaders with a single programmable core that can execute any shader stage, improving utilization and power efficiency.

Execution follows the Single Instruction Multiple Threads (SIMT) model: the same instruction is issued to a warp (typically 32 threads) that operate on different data, and warps are scheduled in lock‑step by the Warp Scheduler.

When a warp encounters a costly operation (e.g., memory load), the scheduler switches to another ready warp to hide latency.

Rendering pipelines differ by platform:

Immediate Mode Rendering (IMR) : each draw call writes directly to the framebuffer, leading to high bandwidth usage.

Tile‑Based Rendering (TBR) : geometry is rasterized into on‑chip tiles, reducing memory traffic and power consumption.

Tile‑Based Deferred Rendering (TBDR) : combines tiling with deferred shading to further lower overdraw.

Optimization tips include minimizing register usage, leveraging SIMD and co‑issue capabilities, avoiding excessive branching (masking), reducing draw calls, using early‑Z, and managing tile clears/discards efficiently.

Example shader code demonstrating vector addition and SIMD usage:

float4 c = a + b; // SIMD addition

Co‑issue example merging two scalar instructions into one vector instruction:

// before co‑issue
ADD r1, a, b
ADD r2, c, d
// after co‑issue
ADD_VEC r1, a, b, c, d

Understanding these hardware details helps developers write performant graphics code for both desktop (IMR) and mobile (TBR/TBDR) platforms.

performancegraphicsOptimizationRenderingGPUShader
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.