Stream Multiprocessor (SM) Architecture and Execution Pipeline in GPUs
This article provides a comprehensive overview of GPU stream multiprocessors, detailing their micro‑architecture, instruction fetch‑decode‑execute pipeline, SIMT/ SIMD organization, warp scheduling, scoreboard mechanisms, and techniques for handling thread divergence and deadlock in GPGPU designs.
Authors: Dr. Chen Wei – expert in compute‑in‑memory/GPU architecture and AI; Dr. Geng Yunchuan – senior SoC and AI accelerator designer.
3.1 Overall Micro‑Architecture
The Stream Multiprocessor (SM) is the core building block of a GPU, analogous to a small CPU that executes multiple thread blocks in parallel and supports instruction‑level parallelism (multiple issue). An SM consists of a SIMT front‑end and a SIMD back‑end, with a six‑stage pipeline: fetch, decode, issue, operand transfer, execution, and write‑back.
Key modules inside an SM include:
Instruction Fetch (I‑Fetch): Sends instruction requests to the instruction cache and updates the program counter.
Instruction Cache (I‑Cache): Supplies cached instructions to the decoder; on a miss, the request is held in a reservation register (MSHR).
Decode Unit: Decodes instructions, forwards source/destination register info to the scoreboard and SIMT stack.
SIMT Stack: Manages control‑flow information for divergent branches.
Scoreboard: Tracks pending register writes to avoid hazards.
Instruction Buffer (I‑Buffer): Holds decoded instructions for each warp; each entry has valid and ready bits.
Back‑End Execution Units: Include CUDA cores (ALU), special‑function units, load/store units, and Tensor cores.
Shared Memory: Provides low‑latency storage for data shared among threads in a block.
3.2 Fetch and Decode
Fetch retrieves the next instruction pointed to by the program counter (PC) and stores it in the instruction register. Decode translates the fetched instruction into control signals for the execution units. In GPUs, after decoding, a warp scheduler assigns instructions to appropriate execution pipelines.
Fetch‑decode flow:
Instruction cache reads aligned bytes and places them in registers.
If the cache hits, the instruction proceeds to decode; on a miss, a request is generated and the warp continues to the next warp.
Decoded instructions are buffered in the I‑Buffer, awaiting issue.
Each warp has at least two I‑Buffer entries, each with a valid bit (instruction not yet issued) and a ready bit (instruction ready for issue).
3.3 Issue
Issue moves ready instructions from the I‑Buffer into the execution pipelines. An issue controller selects a warp each cycle and can issue multiple instructions from the same warp if they satisfy:
Warp is not in a barrier‑wait state.
Instruction is marked valid in the I‑Buffer.
Scoreboard permits the operation.
Operand‑access stage of the pipeline is ready.
Memory‑related instructions are sent to the load/store pipeline, while compute instructions go to the stream processor (SP) units.
3.3.1 SIMT Stack
The SIMT stack handles branch divergence by pushing a new entry for each divergent branch (containing target PC, reconvergence PC, and active thread mask) and popping it when the warp reaches the reconvergence point, thus reducing the performance penalty of divergence.
3.3.2 Warp Scheduling and Scoreboard
Warp scheduling aims to hide memory latency by switching to ready warps while others wait for memory accesses. The scoreboard enforces data‑dependency ordering: each register has a flag indicating whether it is being written; subsequent instructions must wait until the flag is cleared, preventing RAW and WAW hazards.
Two scoreboard designs are used:
In‑Order Scoreboard: Suitable for single‑warp scenarios; each register flag is cleared when the write completes.
Dynamic (Out‑of‑Order) Scoreboard: Scales to multiple warps by creating entries for pending writes and allowing concurrent checks, mitigating entry‑overflow and contention issues.
Overall, the GPU front‑end lacks out‑of‑order execution, allowing smaller core sizes and higher density, while the back‑end executes instructions in parallel across specialized units.
For further reading, see the linked ChatGPT AI model framework report (2023) and the referenced textbooks on high‑performance computing and InfiniBand.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.