Mastering SMP Multi-Core Out-of-Order Execution to Grasp Linux Concurrency
This article deeply dissects the hardware origins of SMP multi‑core out‑of‑order execution, explains four classic memory‑reordering scenarios, and shows how Linux kernel memory barriers constrain the chaos, enabling developers to reliably reason about and fix complex multi‑core concurrency bugs.
SMP Multi‑Core Architecture
Symmetric Multi‑Processing (SMP) is a widely used multi‑processor design where each core has equal status and shares a common memory subsystem and bus, much like a team of equally capable workers sharing a warehouse and a busy transport lane. Modern CPUs such as 4‑core Cortex‑A72 or 8‑core x86 chips embody this architecture.
Because all cores access the same memory, cache‑coherency becomes critical. The MESI protocol acts as a strict warehouse manager: when a core modifies a cache line, it invalidates the copies in other cores so that every core always sees a consistent view of memory.
Multi‑Core and Concurrency Programming
In a single‑core world, a program executes instructions in the order written, and concurrency is simulated by time‑slicing. With multiple cores, each core can run its own thread simultaneously, delivering true parallelism and dramatically increasing throughput. However, this parallelism introduces new challenges such as cache‑coherency violations, where one core may read stale data that another core has already updated.
For example, a simple web crawler on a single core fetches pages sequentially, while on a multi‑core system multiple threads can fetch pages in parallel, shortening total execution time but also exposing the above consistency problems.
Why Out‑of‑Order Execution Happens
Compiler Reordering
During compilation, the compiler may reorder independent instructions to improve performance, provided the single‑threaded program’s final result remains unchanged. For instance, given:
int a = 1;
int b = 2;
int c = a + b;the compiler can swap the first two assignments because they have no data dependency. A more complex example shows the compiler moving a computation ahead of unrelated operations when it detects no dependencies.
The compiler must never break single‑threaded semantics, but in a multi‑threaded context the reordered instructions can cause data races and undefined behavior.
CPU Hardware Reordering
Modern CPUs contain multiple execution units (ALU, FPU, etc.) that can process independent instructions in parallel. When an instruction stalls (e.g., waiting for a memory load), the scheduler looks ahead for ready instructions and executes them out of program order. Consider the following instruction stream:
1. LOAD R1, [A] ; long latency load
2. ADD R2, R1, 5 ; depends on R1
3. MUL R3, R4, R5; independentIn order execution, the CPU would wait for the LOAD to finish before issuing the MUL, leaving the multiplier idle. In out‑of‑order execution, the CPU can issue the MUL immediately, improving resource utilization.
Additional sources of out‑of‑order behavior include superscalar pipelines, branch prediction, and cache‑miss handling. Correct predictions keep the pipeline full; mispredictions require rollback, which also appears as out‑of‑order execution.
Cache Mechanisms (Store Buffer & Invalidate Queue)
Each core has a Store Buffer that temporarily holds write operations before they become visible to other cores. This improves throughput but can delay visibility, leading to scenarios where a consumer thread reads an old value because the producer’s write is still in the Store Buffer.
The Invalidate Queue processes cache‑invalidation messages. If a core reads a variable before the invalidation message is handled, it may see stale data.
Illustrative code demonstrates the problem:
#include <atomic>
#include <thread>
#include <iostream>
std::atomic<bool> flag(false);
int data = 0;
void producer() {
data = 42; // write 1
flag.store(true); // write 2
}
void consumer() {
while (!flag.load()); // spin until flag true
std::cout << "data: " << data << std::endl;
}
int main() {
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
return 0;
}Because the Store Buffer may hold the write to flag, the consumer can spin forever or read data before the write to data becomes visible, producing inconsistent results.
Memory Barriers: The Core Solution
What Is a Memory Barrier?
A memory barrier is a special instruction that prevents the CPU and compiler from reordering memory operations across the barrier, guaranteeing that all operations before the barrier are globally visible before any operation after the barrier proceeds. This restores a predictable order for shared‑memory accesses in multi‑threaded code.
Types of Barriers
Full memory barrier – forces all prior reads and writes to complete before any subsequent reads or writes.
Read memory barrier – only guarantees that prior reads are completed before later reads.
Write memory barrier – only guarantees that prior writes are completed before later writes.
In the Linux kernel, architecture‑specific macros hide the underlying instructions. For x86, the header arch/x86/include/asm/barrier.h defines:
#define mb() asm volatile("mfence" ::: "memory")
#define rmb() asm volatile("lfence" ::: "memory")
#define wmb() asm volatile("sfence" ::: "memory")Linux Barrier Examples
smp_rmb (read barrier) ensures that a flag read completes before the subsequent data read:
#include <linux/atomic.h>
#include <linux/kernel.h>
atomic_t flag = ATOMIC_INIT(0);
int data = 0;
void reader(void) {
while (!atomic_read(&flag))
cpu_relax();
smp_rmb();
printk(KERN_INFO "data = %d
", data);
}smp_wmb (write barrier) guarantees that data is written before the flag is set:
#include <linux/atomic.h>
#include <linux/kernel.h>
atomic_t flag = ATOMIC_INIT(0);
int data = 0;
void writer(void) {
data = 42;
smp_wmb();
atomic_set(&flag, 1);
}smp_mb (full barrier) orders two atomic updates:
#include <linux/atomic.h>
#include <linux/kernel.h>
atomic_t x = ATOMIC_INIT(0);
atomic_t y = ATOMIC_INIT(0);
void update_variables(void) {
atomic_set(&x, 1);
smp_mb();
atomic_set(&y, 2);
}Barrier Implementations on Different Architectures
x86 has a relatively strong memory model; most reordering is already prohibited, so explicit barriers are rarely needed except for Store‑Load cases. MFENCE, LFENCE, and SFENCE provide full, read, and write barriers respectively, and any instruction with the lock prefix acts as a full barrier.
ARM64 uses a weak memory model, allowing many reorderings. Developers must frequently insert barriers such as DMB (Data Memory Barrier), DSB (Data Synchronization Barrier), and ISB (Instruction Synchronization Barrier) to enforce ordering. Example for ARM64:
volatile int shared_data;
// Thread 1
shared_data = 42;
asm volatile ("dmb ish" ::: "memory");
// Thread 2
asm volatile ("dmb ish" ::: "memory");
int value = shared_data;Practical Kernel Use Cases
Scheduler Wake‑Up Path
When a task is woken, the kernel must make the state change visible to all CPUs before the task is placed on a run‑queue:
void wake_up_task(struct task_struct *task) {
task->state = TASK_RUNNING; // set state
smp_wmb(); // ensure visibility
enqueue_task(task); // add to run‑queue
}
struct task_struct *pick_next_task(void) {
struct task_struct *task = dequeue_task();
if (task) {
smp_rmb(); // ensure we see latest state
if (task->state == TASK_RUNNING)
return task;
}
return NULL;
}Without the write barrier, another CPU might not see the updated state and could skip scheduling the task.
Ring Buffer with Barriers
A lock‑free ring buffer uses barriers to protect the write and read pointers:
#define BUFFER_SIZE 1024
typedef struct {
char buffer[BUFFER_SIZE];
volatile int write_index;
volatile int read_index;
} RingBuffer;
void write_to_ring_buffer(RingBuffer *rb, const char *data, int len) {
for (int i = 0; i < len; i++) {
rb->buffer[rb->write_index] = data[i];
smp_wmb(); // ensure data written before pointer update
rb->write_index = (rb->write_index + 1) % BUFFER_SIZE;
}
}
void read_from_ring_buffer(RingBuffer *rb, char *data, int len) {
for (int i = 0; i < len; i++) {
smp_rmb(); // ensure we see latest write pointer
data[i] = rb->buffer[rb->read_index];
rb->read_index = (rb->read_index + 1) % BUFFER_SIZE;
}
}The write barrier prevents the pointer from advancing before the data is stored; the read barrier guarantees the consumer sees the most recent write pointer.
Correct Barrier Usage Principles
Pairing
Read and write barriers must be paired across threads. Using a read barrier without a matching write barrier provides no guarantee of visibility, as shown in the simple shared_variable example.
Minimization
Barriers serialize execution and hurt performance, so they should be used only where shared data actually requires ordering. Purely local code (e.g., a function computing a value in a single thread) never needs a barrier.
Architecture Awareness
Because x86 already enforces many orderings, developers can often omit explicit barriers there, while on ARM64 they are essential for correctness.
Documentation
Every barrier insertion should be accompanied by a comment explaining why it is needed and which counterpart it pairs with, making the code maintainable in large projects.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
