Unlock Linux NUMA Performance: A Practical Multithreaded Tuning Guide
This article explains the fundamentals of NUMA architecture, why it matters for multithreaded Linux applications, and provides step‑by‑step practical guidance—including kernel internals, memory allocation policies, useful commands, and performance‑monitoring tools—to help developers optimize memory locality and boost overall program efficiency.
In today’s data‑intensive era, multithreaded programs are essential for exploiting hardware potential and accelerating applications, especially on Linux systems where Non‑Uniform Memory Access (NUMA) architecture offers both opportunities and challenges. NUMA divides memory into nodes, each tied to specific CPU cores, allowing fast local memory access when threads and memory are co‑located.
However, threads often cross node boundaries, causing performance bottlenecks. This guide dives deep into Linux NUMA tuning for multithreaded programs, covering underlying principles, useful tools, and concrete optimization strategies to harness NUMA’s benefits.
Part1 NUMA是什么?
NUMA (Non‑Uniform Memory Access) is a memory architecture for multiprocessor systems where access latency depends on the memory’s proximity to the processor. Local memory access is much faster than remote memory access.
In traditional Uniform Memory Access (UMA) systems, all processors share a single memory bus, leading to contention as CPU counts rise. NUMA solves this by partitioning the system into multiple nodes, each with its own CPUs and local memory, connected by a high‑speed interconnect.
1.1 传统SMP架构的问题
All CPUs share a common bus to a unified memory pool.
Increasing CPU count makes the bus a bottleneck.
Memory bandwidth cannot satisfy the demand of multiple CPUs.
NUMA addresses these issues by assigning each node a set of CPUs and local memory, reducing global bus contention.
传统 SMP 架构:
CPU1 ─┐
CPU2 ─┼─── 共享总线 ─── 内存
CPU3 ─┤
CPU4 ─┘1.2 NUMA的解决方案
Each node contains a group of CPUs and local memory.
Nodes are linked by a high‑speed interconnect network.
NUMA 架构:
节点0: CPU0,CPU1 ── 本地内存0
│
├─── 互连网络 ──——─┤
│ │
节点1: CPU2,CPU3 ── 本地内存11.3 为什么要进行 NUMA 性能调优?
Performance of multithreaded programs on NUMA depends heavily on memory locality. Without NUMA‑aware design, threads may frequently access remote memory, dramatically increasing latency and reducing bandwidth, which slows down data‑intensive workloads such as big‑data analysis, database servers, and scientific simulations.
Part2 NUMA系统架构
2.1 内存管理的 “进化之路”
⑴SMP 架构的困境 – As CPU core counts grow, shared‑bus contention becomes severe, leading to high latency and reduced throughput.
⑵NUMA 架构应运而生 – The system is split into nodes, each with its own CPUs, memory, and I/O, resembling independent neighborhoods with fast local access and slower remote access.
2.2 Linux 内核中的 NUMA 架构 “画像”
In the Linux kernel, each NUMA node is represented by struct pglist_data (formerly pg_data_t). This structure holds an array node_zones of struct zone, which categorizes memory regions such as ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. These zones serve different hardware needs and are allocated based on node distance and priority.
Memory allocation follows policies like ZONELIST_FALLBACK, preferring the nearest node’s ZONE_NORMAL first, then ZONE_DMA32, and finally ZONE_DMA. When a process requests memory, the kernel scans the node’s zonelist, selecting the highest‑priority zone on the closest node, falling back to farther nodes only if necessary.
Part3 NUMA核心技术
NUMA (Non‑Uniform Memory Access) groups CPUs and memory into nodes, providing faster local memory access (≈100 ns) and slower remote access (≈200‑300 ns). It mitigates UMA’s bus contention, improving scalability for data‑intensive and high‑performance workloads.
Practical techniques include CPU pinning, IRQ affinity, and ensuring that threads operate on data located in the same NUMA node.
3.1 UMA技术
UMA is a shared‑memory architecture where all processors access a single memory pool with uniform latency, suitable for modest workloads but limited by bus bandwidth as CPU counts increase.
3.2 NUMA技术
NUMA assigns each processor a local memory region, providing faster access compared to remote memory.
3.4 vNUMA
vNUMA exposes the host’s NUMA topology to virtual machines, allowing guest OSes to make optimal placement decisions. Changing vNUMA layout can affect stability, especially during vMotion migrations.
3.5 NUMA的重要性
Multithreaded applications benefit from local memory access; remote accesses increase latency. Hypervisors like ESXi also use NUMA to distribute virtual CPUs across nodes for better performance.
Part4 探寻NUMA节点
4.1 探测的 “魔法指令”
Use numactl --hardware or lscpu | grep NUMA to view node count, CPU lists, memory size, and inter‑node distances.
4.2 代码中的 “蛛丝马迹”
Key kernel functions include numa_node_id() (returns the current node ID) and the fields of struct pglist_data such as node_id, node_start_pfn, and node_spanned_pages. The kernel reads ACPI tables at boot to populate these structures.
4.3 查看 NUMA 信息
# 查看 NUMA 拓扑
lscpu | grep NUMA
numactl --hardware
lstopo # 需要安装 hwloc
# 查看进程的 NUMA 使用情况
numastat
numastat -p <pid>
# 查看内存使用情况
cat /proc/meminfo
cat /proc/buddyinfo4.4 监控 NUMA 性能
# 查看 NUMA 命中率
numastat -c
# 使用 perf 监控 NUMA 事件
perf stat -e node-loads,node-load-misses ./program4.5 调试 NUMA 问题
# 查看进程的内存映射
cat /proc/<pid>/numa_maps
# 查看 NUMA 平衡统计
cat /proc/vmstat | grep numa4.6 性能调优
# 禁用自动 NUMA 平衡(可能提高性能)
echo 0 > /proc/sys/kernel/numa_balancing
# 调整 zone_reclaim_mode
echo 1 > /proc/sys/vm/zone_reclaim_modePart5 实战前的准备:工具与环境
5.1 numactl
Install with sudo apt-get install numactl (Debian/Ubuntu) or sudo yum install numactl (CentOS/RHEL). Use numactl --hardware to display node details.
available: 2 nodes (0 - 1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 131037 MB
node 0 free: 3019 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 131071 MB
node 1 free: 9799 MB
node distances:
node 0 1
0: 10 20
1: 20 105.2 numastat
Install the same package as numactl. Run numastat to see per‑node memory hits, misses, and foreign accesses.
node0 node1
numa_hit 1775216830 6808979012
numa_miss 4091495 494235148
numa_foreign 494235148 4091495
interleave_hit 52909 53004
local_node 1775205816 6808927908
other_node 4102509 4942862525.3 perf
Perf is built into the Linux kernel. Use perf stat -p <pid> -e cpu-cycles to measure CPU cycles, or perf top -p <pid> to see hot functions.
Part6 实战开始:NUMA性能调优步骤
6.1 查看系统 NUMA 配置
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,316.2 进程绑定到 NUMA 节点
Bind an existing process: taskset -m 0 -p 1234 Start a new process bound to a node:
numactl --cpunodebind=1 --membind=1 ./test6.3 优化内存分配策略
Set preferred node:
echo 'preferred' > /sys/kernel/mm/transparent_hugepage/enabledSet interleaved allocation:
echo 'interleave' > /sys/kernel/mm/transparent_hugepage/enabledTemporary interleaving with numactl:
numactl --interleave=all ./test6.4 使用大页(Huge Pages)
Check current huge‑page status: cat /proc/meminfo | grep HugePages Allocate 1024 huge pages of 2 MiB:
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepagesRun an application with huge pages bound to a node:
numactl --membind=0 --huge ./test6.5 性能监控与效果评估
Use top to observe CPU and memory usage, perf for cycles and cache misses, and numastat to compare numa_hit vs. numa_miss before and after tuning. Measure execution time with time, throughput, and CPU utilization to verify improvements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
