Understanding NUMA Node Detection and Memory Management in the Linux Kernel
This article explains the fundamentals of NUMA architecture, how Linux detects and represents NUMA nodes, the memory zone hierarchy, allocation policies, and practical techniques such as using numactl and taskset to bind processes for optimal performance on multi‑socket servers.
1. What is NUMA
NUMA (Non‑Uniform Memory Access) overcomes the scalability limits of traditional SMP systems by combining SMP's ease of programming with MPP's scalability, allowing each CPU socket to have its own local memory and inter‑socket links such as QPI.
2. NUMA System Architecture
2.1 Evolution of Memory Management
Early SMP systems suffered from bus contention as CPU counts grew, leading to performance bottlenecks. NUMA introduces separate memory controllers per socket, reducing contention and providing faster local memory access.
2.2 Linux Kernel’s NUMA Representation
In the Linux kernel each NUMA node is described by the pglist_data (formerly pg_data_t ) structure, which contains an array of struct zone objects such as ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. These zones partition memory for different hardware requirements.
ZONE_DMA covers the low 16 MiB for legacy DMA devices, ZONE_DMA32 serves devices that can address up to 4 GiB, and ZONE_NORMAL holds the bulk of usable memory for the kernel.
3. NUMA Core Techniques
3.1 Memory Allocation Policy
When allocating memory, the kernel first consults the node distance matrix. Under the ZONELIST_FALLBACK policy, it prefers the local node’s highest‑priority zone (normally ZONE_NORMAL) and falls back to farther nodes only if necessary.
Special flags such as __GFP_THISNODE enforce the ZONELIST_NOFALLBACK policy, restricting allocation to the current node.
4. Discovering NUMA Nodes
4.1 Useful Commands
The numactl --hardware command displays the number of nodes, CPUs per node, memory size, and inter‑node distances. The /sys/devices/system/node directory contains per‑node files like cpulist and meminfo that provide real‑time topology and usage data.
4.2 Kernel Code Insight
Key kernel functions include numa_node_id() , which reads hardware registers to determine the current node, and the fields of struct pglist_data such as node_id , node_start_pfn , and node_spanned_pages that store node identifiers, start PFN, and size.
5. Practical NUMA Optimization
5.1 Binding Processes to Nodes
Using numactl --cpunodebind=0 --membind=0 <program> forces a workload to run on a specific node, ensuring both CPU execution and memory allocation are local. The taskset command can bind threads to particular CPU cores.
5.2 Performance Impact Example
A dual‑node server (2 × 8 CPU, 32 GiB RAM) running a multithreaded database query sees latency drop from ~200 ns to < 50 ns and bandwidth utilization rise from ~50 % to >80 % after binding the process to a single node with numactl , demonstrating the tangible benefits of NUMA‑aware scheduling.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.