Mastering Linux Memory: Reclaim, Huge Pages, and NUMA Optimization
This article explains common Linux memory‑related performance bottlenecks—such as memory reclamation, page‑cache pressure, huge‑page usage, and cross‑NUMA access—and provides practical tuning methods to improve latency and throughput on servers and applications.
Introduction
Performance problems often manifest as slow UI response on phones or missed service‑level objectives on servers. In Linux, memory is a primary factor, with issues like memory reclaim, increased page faults, and cross‑NUMA accesses degrading user‑visible performance.
Memory Reclamation
The kernel caches disk data in page cache to speed up reads. When memory is scarce, it reclaims this cache, which can cause noticeable latency if the reclaimed pages are needed again.
Memory reclamation operates at two levels: the whole system and memory cgroups.
Per‑zone Watermarks
Three watermarks (min, low, high) control reclamation. Below low , kswapd runs asynchronously until memory reaches high . Below min , allocation is blocked, increasing latency and risking OOM.
The watermarks can be tuned via /proc/sys/vm/watermark_scale_factor , whose valid range is 0‑1000 (default 10). Raising the low watermark helps page‑cache‑heavy workloads by triggering earlier asynchronous reclaim.
Figure 1. per‑zone watermark
Memory cgroup Reclaim
When a memory cgroup reaches its limit, allocation blocks. Since kernel 5.19, the memory.reclaim interface allows userspace to request early reclamation, reducing the chance of blocking.
Huge Pages
Linux uses lazy allocation: a page fault occurs on first write, allocating a 4 KB page. Large pages (2 MB) reduce fault count and TLB pressure. Using huge pages can dramatically improve allocation and address‑translation speed, though they increase initialization cost and memory usage.
Performance can be measured with perf stat -e page-faults -p -- sleep 5 .
Static Huge Pages
Static huge pages (HugeTLB) are reserved at boot via kernel cmdline, e.g., hugepagesz=2M hugepages=512 , or dynamically via /proc/sys/vm/nr_hugepages and /sys/kernel/mm/hugepages interfaces.
<code>echo 20 > /proc/sys/vm/nr_hugepages</code>Applications can allocate them with mmap(MAP_HUGETLB) or use libhugetlbfs to avoid code changes.
Drawbacks: explicit reservation, potential OOM if over‑reserved, and higher memory consumption.
Transparent Huge Pages (THP)
In THP always mode, the kernel tries to allocate a huge page on each fault; if it fails, it falls back to 4 KB pages and later merges them via the khugepaged thread. THP can be set to madvise mode, where applications explicitly request huge pages.
<code>echo madvise > /sys/kernel/mm/transparent_hugepage/enabled</code>THP may increase memory usage, cause reclamation spikes, and hold long‑lasting write locks on mmap_lock , affecting performance.
mmap_lock
mmap_lock protects critical memory‑management structures. Write‑lock contention can arise in mmap/munmap, mremap, and THP merging. Using madvise(MADV_DONTNEED) or madvise(MADV_FREE) reduces write‑lock duration.
<code>for i in `ps -aux | grep " D" | awk '{ print $2}'`; do echo $i; cat /proc/$i/stack; done</code>Tracing can be done with bpftrace:
<code>bpftrace -e 'tracepoint:mmap_lock:mmap_lock_start_locking /args->write == true/{ @[comm, kstack] = count();}'</code>Cross‑NUMA Memory Access
Local node memory access is faster than remote. Use numastat to monitor remote accesses and watch -n 1 numastat -s for live view.
<code>watch -n 1 numastat -s</code>Node Binding
Bind a process to a specific node and its CPUs with numactl to force local memory allocation, though this can limit memory availability and create CPU bottlenecks.
NUMA Balancing
Enable kernel‑wide automatic page migration via /proc/sys/kernel/numa_balancing or the numa_balancing= cmdline flag. Migration incurs page‑fault overhead and may increase cache misses.
Enable it only after confirming the workload benefits.
Conclusion
Memory tuning involves trade‑offs; no single setting fits all workloads. Analyze specific bottlenecks before applying reclamation thresholds, huge‑page policies, or NUMA optimizations, and avoid aggressive changes when performance is stable.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.