Fundamentals 37 min read

Why Memory Barriers Are Essential for Multithreaded Programming

Memory barriers act like traffic signals for concurrent threads, enforcing a strict order of memory operations to prevent data races, cache incoherence, and compiler reordering, thereby ensuring program correctness and stability across multi‑core and multi‑processor systems.

Deepin Linux

Sep 6, 2025

Why Memory Barriers Are Essential for Multithreaded Programming

In multithreaded programming, memory access resembles busy traffic: many threads read and write shared memory like vehicles on a road. Without memory barriers—akin to traffic lights—threads may access memory in unpredictable order, causing data inconsistency and other problems.

Memory barriers provide clear rules and ordering for memory accesses, ensuring that operations occur in the intended sequence, much like vehicles obeying traffic signals, which is crucial for the correctness and stability of multithreaded programs.

1. The "Out‑of‑Order" Problem of Memory Access

1.1 CPU "Little Calculator": Out‑of‑Order Execution

Modern CPUs use out‑of‑order execution to improve efficiency. In strict in‑order execution, instructions run sequentially; if an instruction stalls on a long‑latency operation such as a memory access, the CPU idles. Out‑of‑order execution allows the CPU to execute independent instructions while waiting, maximizing resource utilization.

int a = 5;
int b = 3;

Because the two assignments have no data dependency, their execution order does not affect the final result. Out‑of‑order execution breaks the sequential restriction, letting the CPU schedule other ready instructions while some operands are still pending.

The CPU contains a scheduler that identifies parallelizable instructions, reorders them to maximize throughput, and employs register renaming to avoid conflicts. Branch prediction and other techniques also help maintain correct instruction flow.

1.2 Cache‑Induced Data‑Consistency Issues

Caches bridge the speed gap between CPU and memory. Each core has private L1/L2 caches, and a shared L3 cache. When a core reads data, it first checks its cache (cache hit); otherwise, it fetches from memory (cache miss) and stores the line in its cache.

In multi‑core systems, each core’s cache can hold stale copies of shared data. If one core updates a cache line but another core still sees the old value, data inconsistency occurs. Protocols such as MESI and bus‑snooping ensure cache coherence.

1.3 Compiler Optimization Pitfalls

Compilers may reorder instructions for performance, provided single‑threaded semantics are preserved. Consider the following code:

int a = 0;
int b = 0;
// Thread 1
a = 1;
b = 2;
// Thread 2
if (b == 2) {
    assert(a == 1);
}

In a single‑threaded context the order does not matter, so the compiler may emit:

b = 2;
a = 1;

In a multithreaded environment this reordering can cause the assertion to fail because Thread 2 may observe b == 2 before Thread 1’s write to a becomes visible. Memory barriers prevent such reordering.

2. Linux Memory Barriers in Detail

2.1 What Is a Memory Barrier?

A memory barrier (or memory fence) is a synchronization primitive that forces all memory reads and writes issued before the barrier to complete before any issued after the barrier can proceed. It prevents both the compiler and the processor from reordering memory operations across the barrier.

On most processors a memory‑barrier instruction stalls the pipeline until prior memory operations are globally visible.

Full memory barrier (mb) guarantees that all earlier reads/writes are committed before any later reads/writes.

Read memory barrier (rmb) only orders reads.

Write memory barrier (wmb) only orders writes.

In the Linux kernel (x86 example) these are defined as:

#define mb()    asm volatile("mfence":::"memory")
#define rmb()   asm volatile("lfence":::"memory")
#define wmb()   asm volatile("sfence" ::: "memory")

Conceptually, a barrier is like boiling water before cutting tea: the boiling step must finish before cutting can begin, guaranteeing a correct final result.

2.2 Why Do We Need Memory Barriers?

Modern CPUs employ multiple cache levels and out‑of‑order pipelines. Without barriers, a write performed by one core may remain in its write buffer or cache, invisible to other cores, leading to stale reads. Barriers force the write to be flushed to main memory and invalidate other cores’ caches, ensuring a consistent view.

In SMP systems each core has its own cache; when a core writes, other cores must either invalidate their cached copies or obtain the updated line, which is coordinated by cache‑coherence protocols. Memory barriers synchronize these actions.

2.3 Types and Effects of Barriers

Read Barrier (Load Barrier) ensures that any read after the barrier cannot be reordered before reads preceding it.

// Global variables
int a = 0;
int b = 0;
// Thread 1
a = 1;
rmb(); // read barrier
b = 2;
// Thread 2
if (b == 2) {
    assert(a == 1);
}

Write Barrier (Store Barrier) guarantees that writes before the barrier become visible before any later writes.

// Thread 1
a = 1;
wmb(); // write barrier
b = 2;
// Thread 2
if (b == 2) {
    assert(a == 1);
}

Full Barrier (mb) combines both read and write ordering, preventing any reordering across it.

// Thread 1
a = 1;
mb(); // full barrier
b = 2;
// Thread 2
if (b == 2) {
    assert(a == 1);
}

3. Core Principles Behind Memory Barriers

3.1 Compiler Optimization and Optimization Barriers

Compilers may reorder independent instructions. The Linux kernel provides barrier() to inhibit such reordering:

#define barrier() __asm__ __volatile__("" ::: "memory")

This empty volatile assembly tells the compiler that memory may change, preventing movement of code across the macro.

3.2 CPU Execution Optimizations and Barriers

Out‑of‑order CPUs fetch a window of instructions, execute independent ones in parallel, and retire results in program order. Barriers insert special instructions (e.g., mfence, lfence, sfence) that stall the pipeline until prior memory operations are globally observed.

3.3 How a Barrier Works

// Shared variables
int shared_variable1 = 0;
int shared_variable2 = 0;
// Thread 1
shared_variable1 = 1;   // A
memory_barrier();       // barrier
shared_variable2 = 2;   // B
// Thread 2
if (shared_variable2 == 2) { // C
    assert(shared_variable1 == 1); // D
}

Thread 1 cannot execute B until A is fully committed and visible. Thread 2, upon seeing shared_variable2 == 2, is guaranteed that shared_variable1 has already been updated to 1.

4. Practical Applications of Memory Barriers

4.1 Multithreaded Data Sharing

#include <stdio.h>
#include <pthread.h>
#include <stdatomic.h>

atomic_int shared_variable = 0;

void* writer(void* arg) {
    for (int i = 0; i < 1000000; ++i) {
        shared_variable = i;
        atomic_thread_fence(memory_order_release);
    }
    return NULL;
}

void* reader(void* arg) {
    for (int i = 0; i < 1000000; ++i) {
        atomic_thread_fence(memory_order_acquire);
        int value = shared_variable;
        (void)value;
    }
    return NULL;
}

int main() {
    pthread_t w, r;
    pthread_create(&w, NULL, writer, NULL);
    pthread_create(&r, NULL, reader, NULL);
    pthread_join(w, NULL);
    pthread_join(r, NULL);
    return 0;
}

The release fence ensures that each write to shared_variable is visible before the reader’s acquire fence loads the value, preserving consistency.

4.2 Double‑Checked Locking (DCL)

An incorrect DCL implementation can suffer from reordering of the allocation, construction, and pointer store steps, leading other threads to see a partially constructed object. The correct C++11 version uses std::atomic with acquire/release semantics to guarantee ordering and visibility.

#include <mutex>
#include <atomic>

class Singleton {
private:
    static std::atomic<Singleton*> instance;
    static std::mutex mtx;
    Singleton() {}
public:
    static Singleton* getInstance() {
        Singleton* tmp = instance.load(std::memory_order_acquire);
        if (!tmp) {
            std::lock_guard<std::mutex> lock(mtx);
            tmp = instance.load(std::memory_order_relaxed);
            if (!tmp) {
                tmp = new Singleton();
                instance.store(tmp, std::memory_order_release);
            }
        }
        return tmp;
    }
};

std::atomic<Singleton*> Singleton::instance(nullptr);
std::mutex Singleton::mtx;

4.3 Cache Coherence

// Processor A
x = 1;
mb(); // full barrier to flush cache
// Processor B
mb(); // ensure we see the latest value
assert(x == 1);

The full barrier forces the write to be flushed from A’s cache to main memory and makes B see the updated value, maintaining coherence across CPUs.

5. Usage Guidelines and Performance Considerations

5.1 Avoid Overuse

Barriers inhibit compiler and CPU optimizations; unnecessary barriers in single‑threaded code or in sections without data races degrade performance.

5.2 Choose the Appropriate Barrier Type

Use read barriers when only read ordering matters, write barriers for write ordering, and full barriers when both are required. Selecting the minimal sufficient barrier reduces overhead.

5.3 Performance Monitoring and Tuning

Tools such as perf can profile the impact of barriers. Identify hot paths with many barriers and consider refactoring or replacing full barriers with lighter‑weight read/write barriers where possible.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multithreading Linux kernel cache coherence memory barriers CPU ordering

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.