Understanding Linux Memory Barriers: Types, Usage, and Implementation
This article provides a comprehensive overview of Linux memory barriers, explaining why they are needed for correct ordering of memory operations on modern multi‑core CPUs, describing the different barrier types (read, write, full), their implementation in the kernel and Java, and illustrating their use in synchronization primitives and lock‑free data structures with code examples.
1. Overview of Memory Barriers
Modern CPUs execute instructions out of order to improve performance, which can break the expected order of memory accesses in multithreaded programs. A memory barrier (or fence) is a special instruction that creates a synchronization point, ensuring that all memory operations before the barrier are completed before any operation after the barrier begins.
Types of Barriers
Linux provides three basic barrier macros:
#define mb() asm volatile("mfence":::"memory")
#define rmb() asm volatile("lfence":::"memory")
#define wmb() asm volatile("sfence" ::: "memory")Full barrier (mb) enforces ordering of both loads and stores; read barrier (rmb) orders loads; write barrier (wmb) orders stores.
2. Memory Barriers in Different Contexts
In Java, the volatile keyword inserts a StoreStore barrier before each write and a StoreLoad barrier after each write, as well as a LoadLoad barrier before each read and a LoadStore barrier after each read, giving volatile variables lock‑like visibility guarantees.
For final fields, the compiler inserts StoreStore barriers before the constructor finishes and LoadLoad barriers before the field is read, ensuring the field is fully initialized before any thread can see it.
3. Compiler vs. Runtime Reordering
Compilers may reorder instructions for optimization. For example, compiling with -O2 can change the order of assignments:
int x, y, r;
void f(){
x = r;
y = 1;
}
// Optimized with -O2
movl r(%rip), %eax
movl $1, y(%rip)
movl %eax, x(%rip)To prevent such reordering, a compiler barrier #define barrier() __asm__ __volatile__("": : :"memory") can be used.
4. Runtime Reordering on SMP Systems
In multi‑core (SMP) systems each CPU has its own cache. Without barriers, a write performed by one CPU may stay in its cache and be invisible to another CPU, leading to stale reads. Cache‑coherency protocols (MESI) and mechanisms such as store buffers and invalidate queues are used, but they still require explicit barriers to guarantee ordering.
Store Buffer Example
When a CPU writes, the value may first go into a store buffer and only later be committed to the cache after an invalidate handshake with other CPUs.
Invalidate Queue Example
Invalidate queues allow a CPU to acknowledge an invalidate request immediately, deferring the actual cache line invalidation, which reduces stalls caused by full store‑buffer flushes.
5. Linux Kernel Barrier Macros
The kernel defines both compiler and CPU barriers. For SMP builds the macros expand to the hardware fences; for UP builds they collapse to the compiler barrier:
#define barrier() __asm__ __volatile__("": : :"memory")
#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
#define smp_wmb() wmb()
#else
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
#endif6. Practical Use Cases
Barriers are essential in SMP synchronization, spinlocks, atomic operations, and memory‑consistency models (Sequential Consistency, Total Store Order, etc.). They ensure that writes become visible before subsequent reads, preventing subtle bugs such as the classic "store‑load" reordering where two threads can observe stale values.
7. Example: Lock‑Free Ring Buffer (kfifo)
The Linux kernel implements a lock‑free FIFO using barriers:
unsigned int __kfifo_put(struct kfifo *fifo,
const unsigned char *buffer, unsigned int len)
{
unsigned int l;
len = min(len, fifo->size - fifo->in + fifo->out);
smp_mb(); // ensure out is read before writing
l = min(len, fifo->size - (fifo->in & (fifo->size - 1)));
memcpy(fifo->buffer + (fifo->in & (fifo->size - 1)), buffer, l);
memcpy(fifo->buffer, buffer + l, len - l);
smp_wmb(); // ensure data is written before updating in
fifo->in += len;
return len;
}
unsigned int __kfifo_get(struct kfifo *fifo,
unsigned char *buffer, unsigned int len)
{
unsigned int l;
len = min(len, fifo->in - fifo->out);
smp_rmb(); // ensure in is read before reading data
l = min(len, fifo->size - (fifo->out & (fifo->size - 1)));
memcpy(buffer, fifo->buffer + (fifo->out & (fifo->size - 1)), l);
memcpy(buffer + l, fifo->buffer, len - l);
smp_mb(); // ensure data is read before updating out
fifo->out += len;
return len;
}Here smp_mb() , smp_wmb() , and smp_rmb() guarantee the correct ordering of reads, writes, and index updates between a single producer and a single consumer.
8. When to Consider Barriers
Any code that performs raw loads/stores on shared data without using higher‑level atomic primitives (e.g., GCC built‑ins, libatomic) must explicitly insert the appropriate memory barriers to avoid reordering bugs on SMP systems.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.