Fundamentals 19 min read

Why CPU Cache Matters: From Memory Access to False Sharing and Linux Scheduling

This article explains CPU architecture, cache hierarchy, the concept of cache lines, how false sharing degrades performance, mitigation techniques like cache‑line alignment, and the Linux scheduler's task prioritisation, virtual runtime, and run‑queue mechanisms for fair and real‑time execution.

Liangxu Linux

Dec 24, 2020

Why CPU Cache Matters: From Memory Access to False Sharing and Linux Scheduling

CPU Cache Hierarchy and Cache Lines

Modern CPUs contain per‑core L1 (split into dCache and iCache) and L2 caches, with a shared L3 cache. Outside the CPU are main memory and storage, forming a hierarchy where capacity grows and latency increases down the levels. Access to L1 cache is roughly 100× faster than main memory, so data is moved between memory and cache in cache‑line units (typically 64 bytes). On Linux the cache‑line size can be queried (e.g., getconf LEVEL1_DCACHE_LINESIZE).

When a program accesses an array, consecutive elements are loaded into the same cache line, so iterating in memory order maximises cache‑hit rates. Independent scalar variables, however, can suffer from cache false sharing if they reside on the same line.

False Sharing

Consider two threads on a dual‑core system, each modifying a long variable (A and B) that are placed consecutively in memory. If the 64‑byte cache line starts at A, both variables share the line. The MESI coherence protocol forces the line to bounce between cores on each write, causing a sequence of load → invalidate → write‑back → reload for every modification. This eliminates the benefit of caching and degrades performance.

Avoiding False Sharing

Hot shared data should be placed on separate cache lines. In the Linux kernel the macro __cacheline_aligned_in_smp expands to __cacheline_aligned (which aligns a variable to the cache‑line size) on SMP systems and to nothing on single‑core builds.

On multi‑core kernels the macro aligns the variable to a 64‑byte boundary.

On single‑core kernels it has no effect.

Example struct padding (seven long s before or after a hot field) forces the fields onto different cache lines:

In user‑space, the Java Disruptor framework uses the same idea. RingBufferPad adds seven unused long s as front‑padding, and RingBuffer adds seven as back‑padding, ensuring that frequently accessed fields sit on distinct cache lines and eliminating false sharing.

Linux Task Scheduling

In Linux both processes and threads are represented by a task_struct. Threads share most resources with their parent process, making them lightweight processes.

Tasks have numeric priorities: lower numbers mean higher priority. Real‑time tasks use priorities 0‑99; normal tasks use 100‑139.

Real‑time classes (priority 0‑99): SCHED_DEADLINE , SCHED_FIFO , SCHED_RR .

Normal classes (priority 100‑139): SCHED_NORMAL (CFS) and SCHED_BATCH .

Completely Fair Scheduler (CFS)

CFS assigns each task a virtual runtime ( vruntime). The scheduler always picks the runnable task with the smallest vruntime, providing proportional CPU share.

Weight values derived from the task’s nice level affect how quickly vruntime grows: higher weight → slower vruntime increase → higher chance of being scheduled.

Run Queues

Each CPU has a run queue ( rq) containing three sub‑queues: dl_rq – deadline (real‑time) queue. rt_rq – other real‑time queue. cfs_rq – CFS queue, implemented as a red‑black tree ordered by vruntime. The leftmost node is the next task to run.

Scheduling priority order is Deadline > Real‑time > Fair , so real‑time tasks always pre‑empt normal tasks.

Adjusting Task Priority

By default a new task runs with normal priority (nice = 0). The nice value can be set in the range –20 to 19; lower values increase priority. The kernel maps nice to a weight ( NICE_0_LOAD is a constant) used by CFS to compute vruntime. Example commands:

# Start a process with higher priority
mysqld -n 3   # equivalent to nice -3 mysqld

# Change priority of a running task
renice -5 -p 12345

For latency‑sensitive workloads you may promote a task to a real‑time class (e.g., chrt -f 50 myprog for FIFO) to obtain deterministic scheduling.

Key Takeaways

CPU caches reduce memory traffic; cache‑line size (usually 64 bytes) determines the granularity of data movement.

False sharing occurs when multiple threads modify different variables that share a cache line, causing excessive coherence traffic.

Aligning hot variables to cache‑line boundaries (e.g., __cacheline_aligned_in_smp) or adding padding eliminates false sharing.

In user‑space, padding via dummy fields (as in the Disruptor framework) achieves the same effect.

Linux schedules tasks based on priority classes; real‑time classes pre‑empt normal tasks.

CFS provides fair sharing by selecting the task with the smallest vruntime, where vruntime growth is inversely proportional to the task’s weight (derived from nice).

Adjusting nice or using renice changes a normal task’s weight; promoting to a real‑time class gives deterministic scheduling for latency‑critical workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Cache multithreading CPU false sharing CFS Linux scheduling

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.