Why CPUs Fight Even Without Shared Variables: Understanding False Sharing
The article explains that in multithreaded programs, even when threads operate on independent data, the CPU may still suffer severe performance loss due to false sharing of cache lines, and shows how cache‑line alignment and C++17 hardware‑aware constants can eliminate the problem.
In high‑performance multithreaded programming, developers often assume that if each thread works on its own data without locks or atomic operations, scalability will be linear. In reality, CPUs can still contend heavily because the hardware cache‑coherency protocol works at the granularity of cache lines, not individual variables.
Modern CPUs load data in fixed‑size blocks called cache lines (typically 64 bytes on Intel/AMD, 128 bytes on Apple M‑series). When a core writes to any byte within a cache line, the entire line is marked invalid in the caches of other cores, forcing them to fetch the updated line from memory. This phenomenon is known as false sharing and leads to cache thrashing and stalls.
The article illustrates false sharing with two independent variables A and B that happen to reside in the same 64‑byte cache line. Thread 1 on core 1 repeatedly updates A, causing core 2’s copy of the line (containing B) to become invalid, and vice‑versa. The resulting ping‑pong effect dramatically slows down parallel execution.
A concrete benchmark demonstrates the impact. The sequential (single‑thread) run takes about 2.9 seconds, while the parallel version that suffers false sharing takes roughly 5.0 seconds on the same hardware.
To break the contention, the article presents two solutions. The first uses explicit alignment:
#include <iostream>
#include <thread>
#include <chrono>
#include <atomic>
alignas(64) std::atomic<uint64_t> counter1{0};
alignas(64) std::atomic<uint64_t> counter2{0};
// thread_work, run_no_threads, and main omitted for brevityThis forces each counter onto a separate cache line, eliminating false sharing on Intel CPUs. However, hard‑coding 64 bytes is unsafe on platforms with different line sizes (e.g., Apple M‑series).
C++17 introduces hardware‑aware constants in <new>: std::hardware_destructive_interference_size – the minimum byte distance required to avoid false sharing. std::hardware_constructive_interference_size – the size that fits comfortably within a single cache line.
Using these, the article defines a cache‑line‑aligned structure:
#include <iostream>
#include <new>
#include <vector>
#include <thread>
#include <chrono>
struct alignas(std::hardware_destructive_interference_size) ThreadCounter {
uint64_t count = 0; // payload (8 bytes)
};
struct NormalCounter { uint64_t count = 0; };
// run_benchmark template runs a parallel increment loop for each counter type.Running benchmarks shows the normal (compact) structure taking ~5557 ms, while the cache‑line‑aligned version finishes in ~4148 ms, confirming the performance benefit.
The article concludes that, contrary to common belief, the dominant cost in modern multithreading is often cache‑coherency traffic rather than lock contention. Proper cache‑line alignment, either via alignas or the C++17 hardware constants, is essential for achieving expected scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
