GPU Architecture and Rendering Pipeline Overview
This article provides a comprehensive overview of modern GPU architecture, covering components such as SMs, GPCs, memory hierarchy, unified shader architecture, SIMT execution, warp scheduling, and compares IMR, TBR, and TBDR rendering pipelines while offering practical optimization techniques for developers.
Modern GPUs consist of multiple Graphics Processing Clusters (GPCs), each containing several Streaming Multiprocessors (SMs) that house compute cores, registers, caches, and specialized units such as the PolyMorph Engine and Raster Engine.
Memory hierarchy in a GPU follows a fast‑to‑slow order: registers → shared memory/L1 cache → L2 cache → texture/constant cache → global DRAM, with access latencies ranging from 1 cycle for registers to 400‑600 cycles for global memory.
The unified shader architecture replaces separate vertex and pixel shaders with a single programmable core that can execute any shader stage, improving utilization and power efficiency.
Execution follows the Single Instruction Multiple Threads (SIMT) model: the same instruction is issued to a warp (typically 32 threads) that operate on different data, and warps are scheduled in lock‑step by the Warp Scheduler.
When a warp encounters a costly operation (e.g., memory load), the scheduler switches to another ready warp to hide latency.
Rendering pipelines differ by platform:
Immediate Mode Rendering (IMR) : each draw call writes directly to the framebuffer, leading to high bandwidth usage.
Tile‑Based Rendering (TBR) : geometry is rasterized into on‑chip tiles, reducing memory traffic and power consumption.
Tile‑Based Deferred Rendering (TBDR) : combines tiling with deferred shading to further lower overdraw.
Optimization tips include minimizing register usage, leveraging SIMD and co‑issue capabilities, avoiding excessive branching (masking), reducing draw calls, using early‑Z, and managing tile clears/discards efficiently.
Example shader code demonstrating vector addition and SIMD usage:
float4 c = a + b; // SIMD additionCo‑issue example merging two scalar instructions into one vector instruction:
// before co‑issue
ADD r1, a, b
ADD r2, c, d
// after co‑issue
ADD_VEC r1, a, b, c, dUnderstanding these hardware details helps developers write performant graphics code for both desktop (IMR) and mobile (TBR/TBDR) platforms.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.