Fundamentals 10 min read

Why CPUs Reorder Instructions and How Memory Barriers Preserve Correctness

The article explains how modern CPUs reorder instructions at compile‑time and runtime, the role of store buffers and invalid queues, why memory barriers are needed for visibility across cores, and compares sequential consistency guarantees on x86 versus ARM/Power architectures.

Liangxu Linux

May 19, 2023

Why CPUs Reorder Instructions and How Memory Barriers Preserve Correctness

Introduction

Modern high‑level languages expose multithreading, but CPUs are built with multiple cores and complex cache hierarchies. In Java, the JVM can reorder machine instructions to exploit these features, aiming for maximum performance.

What Is Instruction Reordering?

Instruction reordering occurs twice: first when bytecode is compiled to machine code, and again during CPU execution. Compilers and CPUs may shuffle the order of independent instructions to keep pipelines busy and improve parallelism.

Single‑Core vs Multi‑Core

On a multi‑core system, a write on one core (e.g., W0(x,1)) may reside only in a store buffer and not yet be visible to another core reading R1(x,0). This can happen on both x86 and ARM/Power because the store buffer holds the write before it reaches cache or memory. ARM/Power may also have an invalid queue that delays cache‑line invalidation, further increasing the chance of out‑of‑order visibility.

For x86 CPUs without an invalid queue, once the store buffer flushes to cache the update becomes visible to other cores; for CPUs with an invalid queue, visibility is not guaranteed.

To guarantee visibility across cores, programmers insert memory barriers such as the x86 mfence instruction.

Out‑of‑Order Execution vs Sequential Commit

CPUs use pipeline techniques to execute instructions in parallel, but dependent instructions (data, control, or address dependencies) must respect ordering. The CPU may:

Reorder independent instructions.

Perform branch prediction for control‑dependent code.

Prefetch memory reads.

Despite internal reordering, x86 CPUs provide Sequential Consistency on a single core: the external appearance of instructions follows program order, even though internal execution may be out of order.

"... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program." – Lamport

Thus, the CPU commits results to architectural registers in program order, preserving the illusion of sequential execution.

Store Buffer & Invalid Queue

On x86, the store buffer behaves as a FIFO, so writes leave the buffer in program order. On ARM/Power, the store buffer is not FIFO, allowing later writes to reach cache before earlier ones. A store barrier forces the buffer to flush.

Some CPUs also have an invalid queue that buffers cache‑line invalidation messages. A load barrier flushes this queue, ensuring that stale data is not read.

x86 vs ARM/Power

x86 guarantees Sequential Consistency on a single core and the x86‑TSO model on multiple cores; using mfence makes store‑buffer contents visible to other cores. ARM/Power provide weaker consistency models, requiring explicit barriers for data, control, and address dependencies, making concurrent programming more error‑prone.

Conclusion

Understanding CPU‑level reordering, store buffers, invalid queues, and memory‑barrier instructions is essential for writing correct multithreaded programs, especially on architectures with relaxed memory models.

Reference

[1] Lamport, Leslie. "How to make a multiprocessor computer that correctly executes multiprocess programs." IEEE Transactions on Computers 9 (1979): 690‑691.

[2] Sewell, Peter, et al. "x86‑TSO: a rigorous and usable programmer's model for x86 multiprocessors." Communications of the ACM 53.7 (2010): 89‑97.

[3] Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introduction to the ARM and POWER relaxed memory models." Draft available from http://www.cl.cam.ac.uk/~pes20/ppc‑supplemental/test7.pdf (2012).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CPU x86 Arm instruction reordering memory barriers

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.