Mastering NUMA and Hyper-Threading: Boost CPU Cache Hits and Reduce Latency
This article explains NUMA architecture with hyper‑threading, details CPU cache hierarchies and access latencies, and provides Linux tools and practical optimization techniques to improve cache‑hit rates and minimize cross‑NUMA memory delays.
Support for Hyper-Threading NUMA Architecture
Physical hardware perspective:
Multiple CPUs are packaged together in a socket.
A core is an independent hardware unit on the socket.
Intel Hyper‑Threading (HT) increases logical processors visible to the OS.
Each hardware thread can be addressed as a logical CPU, so the processor appears to have eight CPUs.
From the operating system's perspective:
CPU(s): 8
NUMA node0 CPU(s): 0, 4
NUMA node1 CPU(s): 1, 5
NUMA node2 CPU(s): 2, 6
NUMA node3 CPU(s): 3, 7
L1 cache is split into instruction and data caches; L2 and L3 are unified. L1 and L2 are per‑core, while L3 is shared across all cores. Cache latency increases with distance from the CPU: L1 ≈ 4 cycles, L2 ≈ 11 cycles, L3 ≈ 39 cycles, RAM ≈ 107 cycles.
L1 access latency: 4 CPU cycles
L2 access latency: 11 CPU cycles
L3 access latency: 39 CPU cycles
RAM access latency: 107 CPU cycles
When data resides in cache, the CPU reads it directly (cache hit), dramatically improving performance; thus, code optimization should aim to increase cache‑hit rates.
Typical servers have 10–20+ physical cores per CPU, and often multiple CPU sockets, each with its own cores, L1/L2 caches, shared L3 cache, and attached memory. Sockets are connected via a bus. Example output of
lscpu:
<code>root@ubuntu:~# lscpu
Architecture: x86_64
CPU(s): 32
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 4
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
</code>Note that L3 cache is much larger because it is shared among all cores on a CPU.
If an application starts on one socket, stores data in its local memory, and later runs on another socket, it must access remote memory, incurring higher latency than accessing memory directly attached to the current socket.
Common Performance Monitoring Tools
On Linux, the typical tools for CPU and memory subsystem tuning are top , perf , and numactl .
top : view overall system resource usage; press
1to see per‑core usage; use
top -p $PID -Hfor per‑thread view; press
Fthen select
Pto check thread migration.
perf : powerful profiling tool;
perf topshows functions consuming most CPU cycles;
perf -g record -- sleep 1 -p $PIDrecords system calls;
perf -g latency --sort maxsorts by scheduling latency;
perf reportdisplays results.
numactl : displays NUMA configuration with
numactl -Hand runtime status with
numastat; can bind processes to specific CPUs.
Optimization Methods
1) NUMA optimization – reduce cross‑NUMA memory accesses . Access latency varies: cross‑CPU > cross‑NUMA > same‑NUMA. Set thread affinity to keep memory accesses local. Common techniques:
<code>echo $cpuNumber > /proc/irq/$irq/smp_affinity_list
# Example: echo 0-4 > /proc/irq/78/smp_affinity_list
# echo 3,8 > /proc/irq/78/smp_affinity_list
</code>2) Launch programs with
numactlto restrict them to specific cores, e.g.:
<code>numactl -C 0-7 ./mongod
</code>3) Use
tasksetto bind a program to a core:
<code>taskset -c 0 ./redis-server
</code>4) In C/C++ code, call
sched_setaffinityto set thread affinity.
5) Many open‑source services allow affinity configuration, e.g., Nginx’s
worker_cpu_affinitydirective.
Binding Cores – Important Considerations
In a NUMA system, core numbering does not list all logical cores of one socket before moving to the next. Instead, the first logical core of each physical core across sockets is numbered first, then the second logical cores, and so on. Misunderstanding this order can lead to incorrect core binding.
Preview
Next, we will discuss how improving L1 and L2 cache hit rates can further boost application performance.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.