Unlocking Linux Performance: A Deep Dive into NUMA Architecture
This article explains the core principles of NUMA, its deep integration with the Linux kernel, practical memory‑node and scheduling mechanisms, real‑world database and virtualization use cases, and step‑by‑step commands for inspecting and tuning NUMA on modern servers.
What is NUMA?
NUMA (Non‑Uniform Memory Access) is a hardware architecture that groups CPUs and local memory into nodes, allowing fast local memory access while remote memory access incurs higher latency and lower bandwidth.
NUMA System Architecture Details
2.1 NUMA Architecture Definition
Unlike SMP, where all CPUs share a single memory bus, NUMA divides memory into multiple nodes, each directly attached to a subset of CPUs. This design reduces memory‑access conflicts and improves bandwidth scalability for multi‑core systems.
2.2 NUMA Architecture Components
Key components include processor cores, memory controllers, memory nodes, and the interconnect fabric (e.g., Intel QPI, AMD Infinity Fabric). Nodes are placed close to their CPUs to minimize latency, and high‑speed interconnects link the nodes.
2.3 How NUMA Works
When a core requests memory, the kernel first checks if the target address resides in the core’s local node (local access). If not, the request traverses the interconnect to a remote node (remote access), incurring 1.5‑2× higher latency. Linux mitigates remote access overhead with strategies such as memory prefetching, migration, and load‑balancing.
NUMA Implementation in the Linux Kernel
3.1 Memory‑Node Partitioning
During boot, the kernel reads ACPI tables to discover the hardware topology, builds a NUMA node map, and populates pg_data_t structures that describe each node’s memory layout.
3.2 Memory Allocation Strategies
Linux prefers local allocation (via alloc_pages_current and zoned_page_alloc) and falls back to remote nodes only when local memory is exhausted. A memory‑migration framework can move pages between nodes to balance load.
3.3 Process Scheduling and Affinity
Each task carries a mems_allowed bitmap indicating permissible nodes. System calls like sched_setaffinity and sched_getaffinity let users bind processes to specific CPUs and nodes, while the kernel’s migration thread dynamically re‑balances tasks based on observed memory‑access patterns.
3.4 NUMA‑Aware System Calls
Calls such as get_mempolicy, set_mempolicy, and mbind allow applications to query or enforce memory policies, e.g., binding a memory region to a preferred node to reduce remote accesses.
NUMA Application Scenarios
Database Workloads
Databases benefit from NUMA by keeping hot data in local memory. Tools like numactl can enforce local‑memory allocation, and configuring per‑node buffer pools (e.g., InnoDB instances) can raise QPS by 30%+ and lower latency.
Virtualization
Virtual machines should have their vCPUs and memory bound to the same physical NUMA node. In KVM, numactl or libvirt options can enforce this, improving VM CPU utilization by ~25% and reducing memory‑access latency.
Linux Practical: Viewing and Configuring NUMA
5.1 View NUMA Topology
# 查看NUMA硬件信息
numactl --hardwareTypical output for a dual‑node system shows node IDs, CPU lists, memory size, free memory, and inter‑node distances.
5.2 Check Process NUMA Affinity
# 查看进程ID为1234的NUMA亲和性
numactl --show-membind --show-cpunodebind -p 1234
# 查看所有进程的NUMA内存分配统计
numastatFocus on numa_hit (local accesses) vs. numa_foreign (remote accesses) to spot imbalance.
5.3 Bind a Process to a Specific Node
# Bind at launch
numactl --cpunodebind=0 --membind=0 ./your_application
# Bind an existing process (PID 1234)
numactl --cpunodebind=0 --membind=0 -p 1234
# Bind to specific CPUs within a node
numactl --cpubind=0-7 --membind=0 ./your_applicationJava services can enable NUMA with JVM flags: java -XX:+UseNUMA -XX:NUMAInterleavingRatio=1 -jar app.jar.
Common NUMA Issues
Cross‑node memory accesses cause latency spikes and low CPU utilization, especially on high‑core servers.
In virtualized environments, mismatched vCPU and vMemory placement creates “pseudo‑NUMA” problems.
Disabling NUMA reverts the system to SMP, which can worsen contention on the shared bus.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
