Fundamentals 13 min read

How CPUs Execute Programs: From Fetch to Cache and Multithreading Explained

This article explains the core principles of CPU operation, covering instruction fetching, decoding, execution cycles, register types, pipeline and superscalar architectures, multi‑core and hyper‑threading designs, as well as the hierarchy of caches from registers to L3, providing a comprehensive overview of modern processor fundamentals.

ITPUB

Aug 14, 2019

How CPUs Execute Programs: From Fetch to Cache and Multithreading Explained

CPU Execution Cycle and ISA

When a program is loaded into memory, the CPU repeatedly performs three steps: fetch the next instruction from memory, decode it to determine the operation type and operands, and execute the operation. This fetch‑decode‑execute loop is the fundamental CPU cycle and continues until the program terminates.

Each processor implements a specific instruction set architecture (ISA) . Only binaries compiled for that ISA can run on the CPU (e.g., x86 vs. ARM). The ISA defines the set of machine instructions that the control unit can interpret.

Core Registers and ALU

To avoid the latency of main‑memory accesses, CPUs provide a set of general‑purpose registers that hold frequently used variables and temporary data. Typical registers include:

MAR (Memory Address Register): holds the address of the memory location to be accessed.

MDR (Memory Data Register): holds data read from or to be written to memory.

AC (Accumulator): stores intermediate results of arithmetic or logical operations.

PC (Program Counter): contains the address of the next instruction to fetch.

CIR (Current Instruction Register): holds the instruction currently being executed.

PSW (Program Status Word): contains control bits such as privilege level and execution mode (kernel vs. user).

Stack Pointer : points to the top of the current call stack, enabling function‑call frame management.

The Arithmetic Logic Unit (ALU) implements basic operations (ADD, SUB, NOT, AND, OR). Multiplication and division are typically built from multiple simpler ALU steps, making them slower than addition or logical operations.

Context Switch

During a process switch, the operating system saves the registers that belong to the outgoing process into its kernel stack (or a per‑process control block) and restores the registers of the incoming process. This preserves the execution state and enables rapid resumption.

Pipeline, Superscalar Execution, and Hyper‑Threading

Modern CPUs break the fetch‑decode‑execute sequence into separate hardware units, allowing the three stages to operate concurrently. A simple three‑stage pipeline works as follows:

Stage 1 (Fetch)   → reads instruction n+2
Stage 2 (Decode)  → decodes instruction n+1
Stage 3 (Execute) → executes instruction n

More advanced designs are superscalar : multiple fetch/decode units operate in parallel, each feeding a shared pool of execution units. This increases instruction‑level parallelism.

Hyper‑threading (simultaneous multithreading) creates multiple logical CPUs per physical core. Two threads share the core’s execution resources (ALU, caches, pipelines). While they cannot execute two independent instruction streams simultaneously on the same execution unit, they can overlap in fetch and decode stages, improving overall throughput when the OS schedules threads efficiently.

Multicore Architecture

Physical CPUs are installed in motherboard sockets; each CPU may contain several cores. The operating system treats each core as an independent logical CPU, and each hardware thread (hyper‑thread) as an additional logical processor. Effective utilization requires the scheduler to distribute threads across cores rather than concentrating many threads on a single core.

Privilege Modes and System Calls

Most CPUs support at least two execution modes controlled by a bit in the PSW:

Kernel (privileged) mode : can execute all instructions, access all hardware resources, and modify control registers.

User mode : restricted to a subset of instructions; I/O and memory‑protection operations are prohibited.

When a user‑mode program needs a privileged operation, it issues a system call . The CPU executes a trap instruction, transfers control to the kernel, performs the operation, and then returns to user mode via a return‑from‑trap instruction.

Cache Hierarchy

Beyond registers, CPUs provide a multi‑level cache hierarchy to bridge the speed gap between the core and main memory:

L1 cache (per‑core): split into L1‑icache for decoded instructions and L1‑dcache for frequently accessed data. Smallest and fastest (tens of kilobytes, ~1 ns latency).

L2 cache : larger (hundreds of kilobytes to a few megabytes). May be private to a core or shared among a few cores. Stores data likely to be reused soon.

L3 cache : typically shared across all cores on the die. Larger (several megabytes) but slower than L2.

This hierarchy reduces average memory access latency and improves overall throughput.

Illustrative Diagrams

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multithreading CPU Registers processor architecture caches Instruction Cycle

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.