Operations 9 min read

Mastering NUMA on Linux: Optimize Memory Allocation with numactl

This guide explains NUMA memory hierarchy, shows how to install and use the numactl command, interprets hardware and NUMA statistics, and presents memory allocation strategies to improve performance on multi‑node Linux systems.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Mastering NUMA on Linux: Optimize Memory Allocation with numactl

Preparing the Environment

The examples assume Ubuntu 16.04 but work on other Linux distributions. The test machine has 32 CPUs and 64 GB RAM.

NUMA Storage Hierarchy

1) Processor layer: a single physical core. 2) Local node layer: all processors within a node. 3) Home node layer: nodes adjacent to the local node. 4) Remote node layer: non‑local or non‑adjacent nodes. Access latency increases with node distance, so keeping a process on a single CPU module can greatly improve performance.

CPU chip composition (Kunpeng 920 example)

The Kunpeng 920 SoC groups six core clusters, two I/O clusters, and four DDR controllers into a single chip. Each chip integrates four 72‑bit DDR4 channels (up to 3200 MT/s) supporting up to 512 GB × 4 DDR. L3 cache is split into TAG and DATA parts; TAG resides in each core cluster to reduce latency, while DATA connects to the on‑chip bus. The Hydra Home Agent handles cache coherence across chips, and a GICD module provides interrupt distribution compatible with ARM GICv4. Only one GICD is visible to the OS when multiple clusters exist.

Using numactl

Install the numactl tool (not installed by default) on Ubuntu:

<code>sudo apt install numactl -y</code>

Check the manual with

man numactl

or

numactl --help

. View the system's NUMA configuration:

<code>numactl --hardware</code>

Sample output shows four nodes, each with eight CPUs and about 16 GB of memory, plus L3 cache allocation per node.

The

numastat

command reports statistics such as

numa_hit

,

numa_miss

,

numa_foreign

,

interleave_hit

,

local_node

, and

other_node

. A high

numa_miss

indicates the need to adjust allocation policies, for example by binding processes to specific CPUs.

<code>root@ubuntu:~# numastat
                               node0           node1           node2           node3
numa_hit               19480355292    11164752760    12401311900    12980472384
numa_miss                  5122680       122652623       88449951           7058
numa_foreign           122652643       88449935           7055        5122679
interleave_hit               12619           13942           14010           13924
local_node           19480308881    11164721296    12401264089    12980411641
other_node                5169091       122684087       88497762           67801</code>

NUMA Memory Allocation Strategies

Common options for

numactl

:

--localalloc

or

-l

: allocate memory from the local node.

--membind=nodes

or

-m nodes

: restrict allocation to specified nodes.

--preferred=node

: prefer a node, fall back to others if unavailable.

--interleave=nodes

or

-i nodes

: allocate memory round‑robin across nodes.

<code>numactl --interleave=all mongod -f /etc/mongod.conf</code>

Because the default NUMA policy prefers local memory, imbalance can cause swap usage on a node with insufficient memory, leading to the “swap insanity” phenomenon and severe performance degradation. Operators should monitor NUMA memory distribution and tune system parameters (e.g., memory reclaim, swap tendency) to avoid excessive swapping.

Node → Socket → Core → Processor

Modern CPUs are packaged into sockets; each socket contains multiple cores, and hyper‑threading creates logical processors (threads). In terminology, a socket corresponds to a NUMA node, a core is a physical CPU, and a thread is a logical CPU (processor).

Using lscpu

Typical output fields:

Architecture

CPU(s): logical CPU count

Thread(s) per core

Core(s) per socket

Socket(s)

L1d cache, L1i cache, L2 cache, L3 cache

NUMA node0 CPU(s), etc.

Example:

<code>root@ubuntu:~# lscpu
Architecture:          x86_64
CPU(s):                32
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):              4
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31</code>

Preview

Next, we will discuss how binding CPUs to processes can further boost program performance.

system architectureperformance tuningLinuxNUMAnumactl
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.