How CPUs Execute Programs: From Fetch‑Decode‑Execute to Multicore & Cache
This article explains the core principles of CPU operation, covering the fetch‑decode‑execute cycle, instruction sets, registers, pipeline and superscalar designs, multithreading and multicore behavior, as well as cache hierarchy from registers through L1‑L3, illustrating how these mechanisms affect program execution.
CPU Execution Cycle
When a program is loaded into memory the CPU repeatedly performs three operations: fetch the next instruction from memory, decode the instruction to determine its opcode and operands, and execute the operation. This fetch‑decode‑execute loop continues until the program terminates or a trap transfers control to the operating system.
Instruction Set Architecture (ISA) and Registers
Each processor implements a specific ISA (e.g., x86, ARM). An ISA defines the binary encoding of all instructions that the hardware can execute. Software libraries expose higher‑level instruction sets, but the low‑level ISA is fixed by the silicon.
General‑purpose registers
PC (Program Counter) : Holds the address of the next instruction to fetch. After each fetch the PC is incremented or updated by a branch.
Stack Pointer (SP) : Points to the top of the current call stack, which stores return addresses, function parameters, local variables and saved registers.
PSW (Program Status Word) : Contains control bits such as the current privilege level, interrupt enable flags, and condition codes.
Special registers used by the memory subsystem
MAR (Memory Address Register) : Holds the physical address of the memory location to be accessed.
MDR (Memory Data Register) : Holds the data read from memory or the data to be written to memory.
AC (Accumulator) : Temporary storage for arithmetic and logic results inside the ALU.
CIR (Current Instruction Register) : Holds the instruction currently being executed after decoding.
Arithmetic Logic Unit (ALU) and Control Unit (CU)
The ALU performs integer arithmetic (addition, subtraction, multiplication, division) and logical operations (AND, OR, NOT, XOR). Multiplication and division are typically implemented as micro‑coded sequences, which is why they are slower than addition or logical ops.
The Control Unit orchestrates data movement: it fetches operands from registers or memory, routes them to the ALU, and writes the result back to the appropriate destination. The CU also generates control signals that select which ALU operation to perform based on the decoded opcode.
Example of a simple arithmetic operation:
c = a + bPipeline and Superscalar Architecture
Modern CPUs split the fetch‑decode‑execute cycle into separate pipeline stages so that multiple instructions can be in different stages simultaneously. A classic three‑stage pipeline consists of:
Fetch unit – reads instruction n+2 from memory.
Decode unit – decodes instruction n+1.
Execute unit – executes instruction n.
This overlapping reduces the effective latency of each instruction.
Superscalar processors contain several parallel pipelines. For example, two independent fetch‑decode‑execute pipelines can each fetch, decode, and execute different instructions in the same clock cycle, increasing instruction‑level parallelism.
In the source article the basic loop is expressed as:
取指->解码->执行Multicore, Hyper‑Threading and Context Switching
A physical CPU socket may contain a single chip with multiple cores. Each core appears to the operating system as an independent logical processor. Hyper‑threading (simultaneous multithreading) creates additional logical threads per core that share the core’s execution resources.
When the OS schedules a different thread on the same core, it performs a context switch: the current thread’s register state (including PC, SP, PSW, and all general‑purpose registers) is saved to the thread’s kernel stack, and the saved state of the next thread is restored.
Hyper‑threading improves pipeline utilization by allowing two instruction streams to occupy otherwise idle execution slots, but it does not double the raw computational throughput because the threads compete for the same functional units.
Cache Hierarchy
Beyond registers, CPUs provide a multi‑level cache hierarchy to bridge the speed gap between the core and main memory.
Registers : <1 ns latency, <1 KB total per core.
L1 cache : Private to each core, split into L1‑icache (instructions) and L1‑dcache (data). Typical size 32 KB–64 KB, latency ~4 ns.
L2 cache : Either private per core or shared among a few cores. Size 256 KB–1 MB, latency ~10 ns.
L3 cache : Usually shared by all cores on the die. Size 2 MB–32 MB, latency ~30 ns.
These caches store recently accessed data and decoded instructions, reducing the average memory access time.
Operating Modes and System Calls
CPUs operate in two privilege levels controlled by a bit in the PSW:
Kernel mode : Full access to all instructions and hardware resources.
User mode : Restricted to a subset of instructions; direct I/O and privileged operations are prohibited.
User‑mode code requests privileged services via system calls. The CPU executes a trap instruction, switches to kernel mode, runs the kernel routine, and then returns to user mode with a return‑from‑trap instruction.
Illustrations
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
