Operations 9 min read

Mastering NUMA and Hyper-Threading: Boost CPU Cache Hits and Reduce Latency

This article explains NUMA architecture with hyper‑threading, details CPU cache hierarchies and access latencies, and provides Linux tools and practical optimization techniques to improve cache‑hit rates and minimize cross‑NUMA memory delays.

Ops Development Stories

Jul 15, 2021

Mastering NUMA and Hyper-Threading: Boost CPU Cache Hits and Reduce Latency

Support for Hyper-Threading NUMA Architecture

Physical hardware perspective:

Multiple CPUs are packaged together in a socket.

A core is an independent hardware unit on the socket.

Intel Hyper‑Threading (HT) increases logical processors visible to the OS.

Each hardware thread can be addressed as a logical CPU, so the processor appears to have eight CPUs.

From the operating system's perspective:

CPU(s): 8

NUMA node0 CPU(s): 0, 4

NUMA node1 CPU(s): 1, 5

NUMA node2 CPU(s): 2, 6

NUMA node3 CPU(s): 3, 7

L1 cache is split into instruction and data caches; L2 and L3 are unified. L1 and L2 are per‑core, while L3 is shared across all cores. Cache latency increases with distance from the CPU: L1 ≈ 4 cycles, L2 ≈ 11 cycles, L3 ≈ 39 cycles, RAM ≈ 107 cycles.

L1 access latency: 4 CPU cycles

L2 access latency: 11 CPU cycles

L3 access latency: 39 CPU cycles

RAM access latency: 107 CPU cycles

When data resides in cache, the CPU reads it directly (cache hit), dramatically improving performance; thus, code optimization should aim to increase cache‑hit rates.

Typical servers have 10–20+ physical cores per CPU, and often multiple CPU sockets, each with its own cores, L1/L2 caches, shared L3 cache, and attached memory. Sockets are connected via a bus. Example output of lscpu:

root@ubuntu:~# lscpu
Architecture:          x86_64
CPU(s):                32
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):              4
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):      0-7
NUMA node1 CPU(s):      8-15
NUMA node2 CPU(s):      16-23
NUMA node3 CPU(s):      24-31

Note that L3 cache is much larger because it is shared among all cores on a CPU.

If an application starts on one socket, stores data in its local memory, and later runs on another socket, it must access remote memory, incurring higher latency than accessing memory directly attached to the current socket.

Common Performance Monitoring Tools

On Linux, the typical tools for CPU and memory subsystem tuning are top , perf , and numactl .

top : view overall system resource usage; press 1 to see per‑core usage; use top -p $PID -H for per‑thread view; press F then select P to check thread migration.

perf : powerful profiling tool; perf top shows functions consuming most CPU cycles; perf -g record -- sleep 1 -p $PID records system calls; perf -g latency --sort max sorts by scheduling latency; perf report displays results.

numactl : displays NUMA configuration with numactl -H and runtime status with numastat; can bind processes to specific CPUs.

Optimization Methods

1) NUMA optimization – reduce cross‑NUMA memory accesses . Access latency varies: cross‑CPU > cross‑NUMA > same‑NUMA. Set thread affinity to keep memory accesses local. Common techniques:

echo $cpuNumber > /proc/irq/$irq/smp_affinity_list
# Example: echo 0-4 > /proc/irq/78/smp_affinity_list
#          echo 3,8 > /proc/irq/78/smp_affinity_list

2) Launch programs with numactl to restrict them to specific cores, e.g.:

numactl -C 0-7 ./mongod

3) Use taskset to bind a program to a core:

taskset -c 0 ./redis-server

4) In C/C++ code, call sched_setaffinity to set thread affinity.

5) Many open‑source services allow affinity configuration, e.g., Nginx’s worker_cpu_affinity directive.

Binding Cores – Important Considerations

In a NUMA system, core numbering does not list all logical cores of one socket before moving to the next. Instead, the first logical core of each physical core across sockets is numbered first, then the second logical cores, and so on. Misunderstanding this order can lead to incorrect core binding.

Preview

Next, we will discuss how improving L1 and L2 cache hit rates can further boost application performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Tuning Linux CPU cache numa numactl Hyper-threading

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.