Fundamentals 32 min read

Unveiling GPU Architecture: From Compute Units to Rendering Pipelines

This article provides a comprehensive technical overview of modern GPU architecture, covering memory hierarchy, compute units, shader execution, rendering pipelines, and performance‑optimisation techniques such as unified shaders, SIMT, warp scheduling, and tile‑based rendering strategies.

Architects' Tech Alliance

Oct 16, 2024

Unveiling GPU Architecture: From Compute Units to Rendering Pipelines

GPU Architecture Overview

Modern GPUs consist of a large global memory (DRAM) and many compute units called Stream Multiprocessors (SMs). GPUs are organized into Graphics Processing Clusters (GPCs), each containing several SMs and a raster engine. The number of GPCs and SMs varies by architecture (e.g., Maxwell has 4 GPCs, Turing has 6).

SM Internal Structure (Fermi Example)

PolyMorph Engine : Deforms polygons and fetches vertex data.

Vertex Fetch : Retrieves vertex attributes using triangle indices.

Tessellator : Performs DX11 surface subdivision.

Stream Output : Implements DX10 stream‑output.

Viewport Transform : Clips and maps vertices to screen space.

Attribute Setup : Interpolates vertex attributes for pixel processing.

Core (Stream Processor) : 32 arithmetic units per SM, scheduled by a Warp Scheduler and fed by Dispatch Units.

Warp Scheduler : Executes groups of 32 threads (warps) in lock‑step.

Instruction Cache : Holds fetched shader instructions.

Special Function Unit (SFU) : Executes math functions such as pow, sin, cos.

Load/Store (LD/ST) : Accesses shared memory or global memory.

L1 Cache / Shared Memory : Fast on‑chip storage, sometimes shared with texture cache.

Uniform Cache : Holds constant data.

Texture Unit & Texture Cache : Performs texture sampling.

Interconnect Network : Routes data between GPCs.

GPU Memory Hierarchy

From fastest to slowest: Registers → Shared Memory / L1 Cache → L2 Cache → Texture & Constant Cache → Global Memory (DRAM). Access latency ranges from 1 cycle (registers) to 400‑600 cycles (global memory).

Hardware‑Centric Rendering Pipeline

The CPU builds draw data (vertex buffers, render state) and submits command buffers to the GPU driver. Commands are placed in a ring buffer that the GPU front‑end consumes. When the ring buffer fills, the CPU stalls until the GPU processes commands.

Primitive Distributor assigns primitives to GPCs. Vertex Processing uses the PolyMorph Engine’s Vertex Fetch to load vertex data into SM registers, where shaders run.

Shaders (vertex, geometry, pixel) are executed on SM cores. Each shader invocation maps to a thread; 32 threads form a warp that executes the same instruction on different data. The Instruction Dispatch Unit feeds instructions from the Instruction Cache to each core.

Unified Shader Architecture

Early GPUs had separate vertex and pixel shader units. Modern GPUs use a unified shader model where any core can execute any shader type, improving utilization and power efficiency.

SIMT (Single Instruction Multiple Threads)

All cores in a warp execute the same instruction on different data. For example, the instruction “add r25 and r26, store in r27” is performed simultaneously by all cores, each operating on its own registers.

Warp Scheduling and Masking

If a thread’s condition is false, it is masked out: the warp still advances, but the masked thread does no work. Divergent branches or loops with varying iteration counts waste cycles.

Rendering Architectures

IMR (Immediate Mode Rendering) : Each draw call writes directly to the framebuffer, resulting in high bandwidth and power consumption.

TBR (Tile‑Based Rendering) : The screen is divided into tiles; intermediate results are kept in on‑chip memory and written to the framebuffer only after the tile is fully processed, reducing bandwidth.

TBDR (Tile‑Based Deferred Rendering) : Extends TBR with a deferred shading pass and per‑tile early‑Z culling, further lowering overdraw.

Optimization Techniques

Register Efficiency : Excessive registers reduce the number of active warps, limiting latency hiding. Pack multiple values into a single float4 when possible.

SIMD Utilisation : Combine operations into a single SIMD instruction, e.g. SIMD_ADD c, a, b.

Co‑issue : The hardware can merge compatible instructions to improve throughput.

Scalar Instruction Shader : Allows flexible instruction pairing when vector units are under‑utilised.

Avoid Heavy Branching : Divergent if/else or variable‑length loops cause warp divergence and stalls.

Minimise Expensive SFU Calls : Limit use of pow, sin, cos and similar functions.

Early‑Z Usage : Render opaque geometry first, then alpha‑tested, then alpha‑blended objects to maximise early‑Z culling.

Tile Management : Clear or discard render targets promptly, avoid frequent render‑target switches, and keep draw‑call counts moderate (hundreds rather than thousands) on mobile GPUs.

Register and SIMD Details

GPU ALUs operate on 4‑component vectors (SIMD). Example:

float4 c = a + b; // SIMD adds four components in one cycle

Efficient code packs scale and bias into a single float4 to use one register instead of two:

#define TRANSFORM_TEX(tex, name) (tex.xy * name##_ST.xy + name##_ST.zw)

Co‑issue merges independent instructions to keep the ALU busy. When co‑issue is not possible (e.g., a variable is both source and destination), performance may drop.

Control Flow on GPUs

GPUs execute instructions in lock‑step across a warp. Threads that do not satisfy a branch condition are masked out; they still occupy a cycle but perform no work. Excessive masking reduces effective throughput.

Early‑Z and Overdraw

Early‑Z culls fragments before pixel shading. It is most effective when rendering order follows Opaque → AlphaTest → AlphaBlend . Operations that disable early‑Z (alpha test, manual depth writes, alpha blending, disabled depth test) increase overdraw.

IMR Pseudocode

for (draw in renderPass) {
    for (primitive in draw) {
        execute_vertex_shader(primitive);
    }
    if (primitive is culled) break;
    for (fragment in primitive) {
        execute_fragment_shader(fragment);
    }
}

Mobile Rendering Optimisations

Clear or discard tile data when no longer needed.

Avoid frequent RenderTexture switches to reduce tile‑to‑framebuffer copies.

Keep draw‑call and vertex counts moderate to prevent FrameData overflow.

Summary

This summary outlines the fundamental building blocks of modern GPUs, their memory hierarchy, and how they drive the rendering pipeline. It also highlights key performance considerations such as register pressure, SIMD utilisation, warp scheduling, branching, and early‑Z culling across different rendering architectures (IMR, TBR, TBDR). Understanding these concepts enables developers to write shaders and rendering code that better matches the hardware’s strengths.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Rendering architecture GPU memory Shader SIMT

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.