Fundamentals 17 min read

Inside NVIDIA’s Stream Multiprocessor: How GPUs Execute Parallel Workloads

This article provides a detailed technical overview of the Stream Multi‑processor (SM) in modern GPUs, explaining its micro‑architecture, instruction fetch‑decode pipeline, warp scheduling, SIMT stack handling, scoreboard mechanisms, and strategies for hiding memory latency to maximize parallel execution efficiency.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Inside NVIDIA’s Stream Multiprocessor: How GPUs Execute Parallel Workloads

Overall Microarchitecture

The Stream Multi‑processor (SM) is the core building block of a GPU, executing an entire kernel grid. Each SM hosts multiple thread blocks, and each thread block contains several warps (NVIDIA) or wavefronts (AMD). An SM functions like a small CPU, supporting instruction‑level parallelism (ILP) through multi‑issue execution, but it generally does not perform out‑of‑order execution. Warps execute instructions in SIMD fashion.

SM position in GPU architecture
SM position in GPU architecture

The SM pipeline is divided into six stages: fetch, decode, issue, operand transfer, execution, and write‑back. It can be conceptually split into a SIMT front‑end and a SIMD back‑end.

SM pipeline stages
SM pipeline stages

Fetch and Decode

Instruction fetch (I‑Fetch) reads the next instruction from memory based on the program counter (PC) and places it in the instruction register. The fetch unit sends requests to the instruction cache (I‑Cache). If the cache hits, the instruction is forwarded to the decode unit; otherwise, a miss is recorded in the reservation station (MSHR) and the request is retried.

Decode (Decode) translates the fetched instruction into control signals for the execution units and forwards operand and destination register information to the SIMT stack and scoreboard.

Fetch‑decode structure
Fetch‑decode structure

Instruction Cache (I‑Cache) : Stores fetched instruction blocks; on a miss, the request is recorded in the MSHR.

Instruction Buffer (I‑Buffer) : Holds decoded instructions for each warp. Each entry tracks a valid bit and a ready bit.

SIMT Stack : Manages control‑flow information for divergent branches.

Scoreboard : Tracks pending register writes to avoid hazards.

Launch

After decoding, the warp scheduler selects a ready warp and issues one or more instructions per cycle to the appropriate execution units. An instruction can be launched only if the warp is not stalled by a barrier, the instruction is marked valid in the I‑Buffer, the scoreboard permits it, and the operand‑fetch stage is ready.

Memory‑related instructions (load/store) are sent to the memory pipeline, while arithmetic/logic instructions go to the CUDA cores (ALUs) or specialized units such as Tensor cores.

SIMT Stack

The SIMT stack handles divergent branches by pushing a new entry for each new branch target (PC) and a reconvergence PC. When a warp reaches its reconvergence point, the top entry is popped. This mechanism reduces the performance penalty of branch divergence.

Warp Scheduling and Scoreboard

The warp scheduler aims to hide memory latency by switching to warps whose operands are ready while others wait for memory. Different instruction types are dispatched to dedicated units (LD/ST, INT, FP). In ideal cases, round‑robin scheduling can completely mask memory delays.

Scoreboard entries record the write status of each register. An instruction that reads a register must wait until the corresponding write completes, preventing read‑after‑write and write‑after‑write hazards. For single‑warp scenarios, an in‑order scoreboard suffices; for multiple warps, a dynamic scoreboard with multiple entries per pending write is used.

Dynamic scoreboard entry flow
Dynamic scoreboard entry flow
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GPUSIMTmicroarchitectureScoreboardStream Multiprocessorwarp scheduling
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.