Fundamentals 55 min read

Master Linux Memory Performance: From Theory to Real‑World Optimization

This article systematically breaks down Linux's core memory mechanisms, identifies common performance bottlenecks, and demonstrates how to use tools like numastat, perf, and Valgrind together with kernel parameters such as swappiness and min_free_kbytes to achieve practical memory optimizations.

Deepin Linux
Deepin Linux
Deepin Linux
Master Linux Memory Performance: From Theory to Real‑World Optimization

1. Linux Memory Fundamentals

Understanding the virtual memory layout, VMA (Virtual Memory Area), memory mapping (mmap), huge pages, and memory fragmentation is essential for any memory performance work. Physical memory is the actual RAM, while virtual memory gives each process the illusion of a large, contiguous address space managed by the kernel.

1.1 Virtual Memory Space

In a 64‑bit system the virtual address space can be up to 2^64, but typically only a portion (e.g., 48‑bit) is used. The space is divided into user space (code, data, heap, stack) and kernel space.

.text : read‑only executable code.

.data : initialized global/static variables.

.bss : uninitialized globals, zero‑filled at load time.

Heap : dynamic allocations via malloc / free.

Stack : function call frames, grows downward.

1.2 VMA (Virtual Memory Area)

A VMA describes a contiguous region of a process's virtual address space, including its start/end addresses, permission flags, and associated file (if any). The /proc/<pid>/maps file lists all VMAs for a process.

1.3 Memory Mapping (mmap)

mmap

maps a file or device directly into a process's address space, avoiding extra copies between user and kernel buffers. The core steps are:

Create a virtual address range and a vm_area_struct describing it.

Establish the page‑table mapping; actual data is loaded lazily.

On first access, a page‑fault loads the required page (lazy loading).

For MAP_SHARED, modified pages are marked dirty and written back; for MAP_PRIVATE, copy‑on‑write creates a private copy.

Typical uses include large‑file processing, inter‑process shared memory, and zero‑copy I/O.

1.4 Huge Pages

Huge pages (2 MiB or 1 GiB) reduce the number of page‑table entries and improve TLB hit rates. They are especially beneficial for databases and high‑performance computing.

void *mmap_huge_page() {
    void *addr = mmap(NULL, 2*1024*1024,
                      PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                      -1, 0);
    if (addr == MAP_FAILED) {
        perror("mmap_huge_page failed");
        return NULL;
    }
    return addr;
}

1.5 Memory Fragmentation

External fragmentation leaves small free gaps that cannot satisfy large allocations; internal fragmentation wastes space inside allocated blocks. Frequent allocations and frees, or fixed‑size allocators, are common causes. Tools such as /proc/meminfo and Valgrind can help detect fragmentation.

2. Locating Memory Performance Bottlenecks

Accurate bottleneck identification relies on tools like numastat, perf, and valgrind, as well as kernel parameters such as swappiness and min_free_kbytes.

2.1 numastat – NUMA Access Insight

Install and run:

sudo apt-get install numastat   # Debian/Ubuntu
sudo yum install numastat       # RHEL/CentOS
numastat -p <PID>

Key fields:

Numa_Hit : local node accesses.

Numa_Miss : remote node accesses (high values may reduce performance by 20‑30%).

Numa_Foreign : total remote memory fetched.

2.2 perf – Full‑Stack Profiling

Example to measure cache behavior:

perf stat -e cache-references,cache-misses ./your_program

Typical output shows cache‑miss rate (e.g., 28%). For hotspot analysis:

perf record -e LLC-load-misses ./your_program
perf report

2.3 Valgrind – Memory Debugging

Compile with debugging symbols:

g++ -g -O0 -Wall main.cpp -o myapp

Run memcheck:

valgrind --tool=memcheck --leak-check=full ./myapp

Valgrind reports leaks, e.g., "40 bytes in 1 blocks are definitely lost".

3. Core Memory Performance Bottlenecks

3.1 Page Faults

Major page faults require disk I/O; minor faults only update page tables or TLB. Frequent major faults increase latency from nanoseconds to milliseconds and raise CPU usage.

Mitigations include adding RAM, improving data locality, and using huge pages.

3.2 Cross‑NUMA Access

NUMA nodes have local memory with low latency; remote accesses are slower. Causes include non‑NUMA‑aware schedulers and memory allocators. Mitigation strategies:

Bind threads to nodes with numactl --membind=0 --cpunodebind=0 ./app or taskset -c 0-7 ./app.

Use NUMA‑aware allocation functions like numa_alloc_onnode().

Enable kernel automatic NUMA balancing via echo 1 > /sys/kernel/mm/numa_balancing.

3.3 mmap_lock Contention

The kernel protects mmap structures with a read‑write semaphore mmap_lock. High contention slows mmap/munmap/mprotect calls and adds context‑switch overhead.

Optimization tips:

Reduce the number of mmap/munmap operations; map once and reuse.

Hold the lock only briefly; perform heavy processing after releasing it.

Use finer‑grained locks for separate memory regions.

// Example: map once and reuse
int fd = open("datafile", O_RDONLY);
void *map_base = mmap(NULL, 1024*1024*100, PROT_READ, MAP_SHARED, fd, 0);
// Reuse map_base without repeated mmap/munmap

3.4 Memory Reclamation (kswapd)

When free memory falls below low watermarks, the kernel thread kswapd reclaims file cache and anonymous pages. Frequent reclamation raises CPU usage and can cause memory “thrashing”. Adjusting vm.min_free_kbytes and vm.watermark_scale_factor can reduce wake‑ups.

# Reduce reclamation frequency
echo 524288 > /proc/sys/vm/min_free_kbytes   # 0.5 GB
echo 5 > /proc/sys/vm/watermark_scale_factor

3.5 Memory Leaks and Overflows

Leaks (unreleased malloc / mmap memory) gradually consume RAM, leading to OOM situations. Overflows occur when allocation requests cannot be satisfied. Detection tools:

Valgrind memcheck for leaks. top / ps to monitor growth. pmap to inspect address space. perf to profile allocation hotspots.

Best practices: always pair allocation with deallocation, set limits on dynamic containers, and consider memory pools.

4. OOM Killer Self‑Protection Mechanism

4.1 Trigger Conditions

When the system cannot satisfy a memory allocation even after reclaiming caches and swapping, the OOM Killer is invoked (unless vm.panic_on_oom forces a panic).

4.2 Selection Logic

Each user process receives an oom_badness score: badness = (RSS / memory_limit) * 1000 + oom_score_adj. The highest‑scoring process receives SIGKILL. Critical services set a negative oom_score_adj (e.g., -1000) to avoid being killed.

4.3 Mitigation Strategies

Application level: avoid leaks, tune JVM heap ( -Xms, -Xmx), and release resources promptly.

System level: adjust vm.min_free_kbytes (0.5‑1 % of total RAM), configure swap size, and use cgroups to limit per‑process memory.

# Example cgroup limit
echo 512M > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes

Increasing swap or lowering swappiness (e.g., to 10‑20) reduces the chance of major page faults and OOM events.

echo 10 > /proc/sys/vm/swappiness

Regularly restart long‑running services to clear accumulated leaks and monitor memory usage with the tools described above.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceOptimizationkernelLinuxMemoryOOMNUMAhugepages
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.