Fundamentals 16 min read

Stream Multiprocessor (SM) Architecture and Execution Pipeline in GPUs

This article provides a comprehensive overview of GPU stream multiprocessors, detailing their micro‑architecture, instruction fetch‑decode‑execute pipeline, SIMT/ SIMD organization, warp scheduling, scoreboard mechanisms, and techniques for handling thread divergence and deadlock in GPGPU designs.

Architects' Tech Alliance

Mar 29, 2023

Stream Multiprocessor (SM) Architecture and Execution Pipeline in GPUs

Authors: Dr. Chen Wei – expert in compute‑in‑memory/GPU architecture and AI; Dr. Geng Yunchuan – senior SoC and AI accelerator designer.

3.1 Overall Micro‑Architecture

The Stream Multiprocessor (SM) is the core building block of a GPU, analogous to a small CPU that executes multiple thread blocks in parallel and supports instruction‑level parallelism (multiple issue). An SM consists of a SIMT front‑end and a SIMD back‑end, with a six‑stage pipeline: fetch, decode, issue, operand transfer, execution, and write‑back.

Key modules inside an SM include:

Instruction Fetch (I‑Fetch): Sends instruction requests to the instruction cache and updates the program counter.

Instruction Cache (I‑Cache): Supplies cached instructions to the decoder; on a miss, the request is held in a reservation register (MSHR).

Decode Unit: Decodes instructions, forwards source/destination register info to the scoreboard and SIMT stack.

SIMT Stack: Manages control‑flow information for divergent branches.

Scoreboard: Tracks pending register writes to avoid hazards.

Instruction Buffer (I‑Buffer): Holds decoded instructions for each warp; each entry has valid and ready bits.

Back‑End Execution Units: Include CUDA cores (ALU), special‑function units, load/store units, and Tensor cores.

Shared Memory: Provides low‑latency storage for data shared among threads in a block.

3.2 Fetch and Decode

Fetch retrieves the next instruction pointed to by the program counter (PC) and stores it in the instruction register. Decode translates the fetched instruction into control signals for the execution units. In GPUs, after decoding, a warp scheduler assigns instructions to appropriate execution pipelines.

Fetch‑decode flow:

Instruction cache reads aligned bytes and places them in registers.

If the cache hits, the instruction proceeds to decode; on a miss, a request is generated and the warp continues to the next warp.

Decoded instructions are buffered in the I‑Buffer, awaiting issue.

Each warp has at least two I‑Buffer entries, each with a valid bit (instruction not yet issued) and a ready bit (instruction ready for issue).

3.3 Issue

Issue moves ready instructions from the I‑Buffer into the execution pipelines. An issue controller selects a warp each cycle and can issue multiple instructions from the same warp if they satisfy:

Warp is not in a barrier‑wait state.

Instruction is marked valid in the I‑Buffer.

Scoreboard permits the operation.

Operand‑access stage of the pipeline is ready.

Memory‑related instructions are sent to the load/store pipeline, while compute instructions go to the stream processor (SP) units.

3.3.1 SIMT Stack

The SIMT stack handles branch divergence by pushing a new entry for each divergent branch (containing target PC, reconvergence PC, and active thread mask) and popping it when the warp reaches the reconvergence point, thus reducing the performance penalty of divergence.

3.3.2 Warp Scheduling and Scoreboard

Warp scheduling aims to hide memory latency by switching to ready warps while others wait for memory accesses. The scoreboard enforces data‑dependency ordering: each register has a flag indicating whether it is being written; subsequent instructions must wait until the flag is cleared, preventing RAW and WAW hazards.

Two scoreboard designs are used:

In‑Order Scoreboard: Suitable for single‑warp scenarios; each register flag is cleared when the write completes.

Dynamic (Out‑of‑Order) Scoreboard: Scales to multiple warps by creating entries for pending writes and allowing concurrent checks, mitigating entry‑overflow and contention issues.

Overall, the GPU front‑end lacks out‑of‑order execution, allowing smaller core sizes and higher density, while the back‑end executes instructions in parallel across specialized units.

For further reading, see the linked ChatGPT AI model framework report (2023) and the referenced textbooks on high‑performance computing and InfiniBand.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU SIMT instruction pipeline Scoreboard Stream Multiprocessor Warp

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.