Operations 9 min read

Mastering NUMA and Hyper-Threading: Boost CPU Cache Hits and Reduce Latency

This article explains NUMA architecture with hyper‑threading, details CPU cache hierarchies and access latencies, and provides Linux tools and practical optimization techniques to improve cache‑hit rates and minimize cross‑NUMA memory delays.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Mastering NUMA and Hyper-Threading: Boost CPU Cache Hits and Reduce Latency

Support for Hyper-Threading NUMA Architecture

Physical hardware perspective:

Multiple CPUs are packaged together in a socket.

A core is an independent hardware unit on the socket.

Intel Hyper‑Threading (HT) increases logical processors visible to the OS.

Each hardware thread can be addressed as a logical CPU, so the processor appears to have eight CPUs.

From the operating system's perspective:

CPU(s): 8

NUMA node0 CPU(s): 0, 4

NUMA node1 CPU(s): 1, 5

NUMA node2 CPU(s): 2, 6

NUMA node3 CPU(s): 3, 7

L1 cache is split into instruction and data caches; L2 and L3 are unified. L1 and L2 are per‑core, while L3 is shared across all cores. Cache latency increases with distance from the CPU: L1 ≈ 4 cycles, L2 ≈ 11 cycles, L3 ≈ 39 cycles, RAM ≈ 107 cycles.

L1 access latency: 4 CPU cycles

L2 access latency: 11 CPU cycles

L3 access latency: 39 CPU cycles

RAM access latency: 107 CPU cycles

When data resides in cache, the CPU reads it directly (cache hit), dramatically improving performance; thus, code optimization should aim to increase cache‑hit rates.

Typical servers have 10–20+ physical cores per CPU, and often multiple CPU sockets, each with its own cores, L1/L2 caches, shared L3 cache, and attached memory. Sockets are connected via a bus. Example output of

lscpu

:

<code>root@ubuntu:~# lscpu
Architecture:          x86_64
CPU(s):                32
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):              4
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):      0-7
NUMA node1 CPU(s):      8-15
NUMA node2 CPU(s):      16-23
NUMA node3 CPU(s):      24-31
</code>

Note that L3 cache is much larger because it is shared among all cores on a CPU.

If an application starts on one socket, stores data in its local memory, and later runs on another socket, it must access remote memory, incurring higher latency than accessing memory directly attached to the current socket.

Common Performance Monitoring Tools

On Linux, the typical tools for CPU and memory subsystem tuning are top , perf , and numactl .

top : view overall system resource usage; press

1

to see per‑core usage; use

top -p $PID -H

for per‑thread view; press

F

then select

P

to check thread migration.

perf : powerful profiling tool;

perf top

shows functions consuming most CPU cycles;

perf -g record -- sleep 1 -p $PID

records system calls;

perf -g latency --sort max

sorts by scheduling latency;

perf report

displays results.

numactl : displays NUMA configuration with

numactl -H

and runtime status with

numastat

; can bind processes to specific CPUs.

Optimization Methods

1) NUMA optimization – reduce cross‑NUMA memory accesses . Access latency varies: cross‑CPU > cross‑NUMA > same‑NUMA. Set thread affinity to keep memory accesses local. Common techniques:

<code>echo $cpuNumber > /proc/irq/$irq/smp_affinity_list
# Example: echo 0-4 > /proc/irq/78/smp_affinity_list
#          echo 3,8 > /proc/irq/78/smp_affinity_list
</code>

2) Launch programs with

numactl

to restrict them to specific cores, e.g.:

<code>numactl -C 0-7 ./mongod
</code>

3) Use

taskset

to bind a program to a core:

<code>taskset -c 0 ./redis-server
</code>

4) In C/C++ code, call

sched_setaffinity

to set thread affinity.

5) Many open‑source services allow affinity configuration, e.g., Nginx’s

worker_cpu_affinity

directive.

Binding Cores – Important Considerations

In a NUMA system, core numbering does not list all logical cores of one socket before moving to the next. Instead, the first logical core of each physical core across sockets is numbered first, then the second logical cores, and so on. Misunderstanding this order can lead to incorrect core binding.

Preview

Next, we will discuss how improving L1 and L2 cache hit rates can further boost application performance.

Performance TuninglinuxCPU cacheNUMAnumactlhyper-threading
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.