Fundamentals 23 min read

Unlocking Linux Performance: A Deep Dive into NUMA Architecture

This article explains the core principles of NUMA, its deep integration with the Linux kernel, practical memory‑node and scheduling mechanisms, real‑world database and virtualization use cases, and step‑by‑step commands for inspecting and tuning NUMA on modern servers.

Deepin Linux

Mar 28, 2026

Unlocking Linux Performance: A Deep Dive into NUMA Architecture

What is NUMA?

NUMA (Non‑Uniform Memory Access) is a hardware architecture that groups CPUs and local memory into nodes, allowing fast local memory access while remote memory access incurs higher latency and lower bandwidth.

NUMA System Architecture Details

2.1 NUMA Architecture Definition

Unlike SMP, where all CPUs share a single memory bus, NUMA divides memory into multiple nodes, each directly attached to a subset of CPUs. This design reduces memory‑access conflicts and improves bandwidth scalability for multi‑core systems.

2.2 NUMA Architecture Components

Key components include processor cores, memory controllers, memory nodes, and the interconnect fabric (e.g., Intel QPI, AMD Infinity Fabric). Nodes are placed close to their CPUs to minimize latency, and high‑speed interconnects link the nodes.

2.3 How NUMA Works

When a core requests memory, the kernel first checks if the target address resides in the core’s local node (local access). If not, the request traverses the interconnect to a remote node (remote access), incurring 1.5‑2× higher latency. Linux mitigates remote access overhead with strategies such as memory prefetching, migration, and load‑balancing.

NUMA Implementation in the Linux Kernel

3.1 Memory‑Node Partitioning

During boot, the kernel reads ACPI tables to discover the hardware topology, builds a NUMA node map, and populates pg_data_t structures that describe each node’s memory layout.

3.2 Memory Allocation Strategies

Linux prefers local allocation (via alloc_pages_current and zoned_page_alloc) and falls back to remote nodes only when local memory is exhausted. A memory‑migration framework can move pages between nodes to balance load.

3.3 Process Scheduling and Affinity

Each task carries a mems_allowed bitmap indicating permissible nodes. System calls like sched_setaffinity and sched_getaffinity let users bind processes to specific CPUs and nodes, while the kernel’s migration thread dynamically re‑balances tasks based on observed memory‑access patterns.

3.4 NUMA‑Aware System Calls

Calls such as get_mempolicy, set_mempolicy, and mbind allow applications to query or enforce memory policies, e.g., binding a memory region to a preferred node to reduce remote accesses.

NUMA Application Scenarios

Database Workloads

Databases benefit from NUMA by keeping hot data in local memory. Tools like numactl can enforce local‑memory allocation, and configuring per‑node buffer pools (e.g., InnoDB instances) can raise QPS by 30%+ and lower latency.

Virtualization

Virtual machines should have their vCPUs and memory bound to the same physical NUMA node. In KVM, numactl or libvirt options can enforce this, improving VM CPU utilization by ~25% and reducing memory‑access latency.

Linux Practical: Viewing and Configuring NUMA

5.1 View NUMA Topology

# 查看NUMA硬件信息
numactl --hardware

Typical output for a dual‑node system shows node IDs, CPU lists, memory size, free memory, and inter‑node distances.

5.2 Check Process NUMA Affinity

# 查看进程ID为1234的NUMA亲和性
numactl --show-membind --show-cpunodebind -p 1234
# 查看所有进程的NUMA内存分配统计
numastat

Focus on numa_hit (local accesses) vs. numa_foreign (remote accesses) to spot imbalance.

5.3 Bind a Process to a Specific Node

# Bind at launch
numactl --cpunodebind=0 --membind=0 ./your_application
# Bind an existing process (PID 1234)
numactl --cpunodebind=0 --membind=0 -p 1234
# Bind to specific CPUs within a node
numactl --cpubind=0-7 --membind=0 ./your_application

Java services can enable NUMA with JVM flags: java -XX:+UseNUMA -XX:NUMAInterleavingRatio=1 -jar app.jar.

Common NUMA Issues

Cross‑node memory accesses cause latency spikes and low CPU utilization, especially on high‑core servers.

In virtualized environments, mismatched vCPU and vMemory placement creates “pseudo‑NUMA” problems.

Disabling NUMA reverts the system to SMP, which can worsen contention on the shared bus.

Performance optimization System Architecture memory management Linux kernel NUMA

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.