Fundamentals 28 min read

Master Linux Memory Debugging with BPFTrace: Leaks, OOM, and More

This article explains common memory errors, lists kernel and user‑space event sources for tracing memory activity, and demonstrates how to use BPFTrace, kprobes, and USDT probes to detect out‑of‑bounds accesses, use‑after‑free, memory leaks, OOM killer events, mmap/brk usage, shared memory operations, page faults, and reclaim processes on Linux.

Open Source Linux
Open Source Linux
Open Source Linux
Master Linux Memory Debugging with BPFTrace: Leaks, OOM, and More

Memory Detection

General memory access errors include out‑of‑bounds, use‑after‑free, double free, memory leak, and stack overflow.

Event sources for tracking memory activity

Event Type

Event Source

User‑space memory allocation

uprobes on allocator functions, USDT probes on libc

Kernel‑space memory allocation

kprobes on allocator functions, kmem tracepoints

Heap expansion

brk system‑call tracepoint

Shared‑memory functions

system‑call tracepoints

Page‑fault errors

kprobes, software events, exception tracepoints

Page migration

migration tracepoints

Page compaction

compaction tracepoints

VM scanner

Vmscan tracepoints

Memory access cycles

PMC

For processes that use the libc memory allocator, libc provides a set of allocation functions such as malloc() and free(). The libc library already embeds several USDT trace points that can be used in applications to monitor libc behavior.

Available USDT probes in libc:

# sudo bpftrace -l usdt:/lib/x86_64-linux-gnu/libc-2.31.so
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:setjmp
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:longjmp
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:longjmp_target
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:lll_lock_wait_private
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_arena_max
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_arena_test
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_tunable_tcache_max_bytes
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_tunable_tcache_count
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_tunable_tcache_unsorted_limit
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_trim_threshold
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_top_pad
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_mmap_threshold
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_mmap_max
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_perturb
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_mxfast
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_heap_new
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_arena_reuse_free_list
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_arena_reuse
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_arena_reuse_wait
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_arena_new
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_arena_retry
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_sbrk_less
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_heap_free
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_heap_less
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_tcache_double_free
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_heap_more
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_sbrk_more
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_malloc_retry
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_memalign_retry
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt_free_dyn_thresholds
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_realloc_retry
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_calloc_retry
usdt:/lib/x86_64-linux-gnu/libc-2.31.so:libc:memory_mallopt

oomkill

Use kprobes to trace the oom_kill_process() function and read /proc/loadavg to obtain load‑average information, which provides context about system load when an OOM event occurs.

static void oom_kill_process(struct oom_control *oc, const char *message)
# cat /proc/loadavg 
0.05 0.10 0.13 1/875 23359

memleak

memleak

tracks allocation and free events together with their call‑stack information, showing long‑living allocations over time.

For user‑space processes, memleak monitors malloc(), calloc() and free(). For kernel‑space memory, it uses kprobes such as:

kmem:kfree                     [Tracepoint event]
kmem:kmalloc                   [Tracepoint event]
kmem:kmalloc_node              [Tracepoint event]
kmem:kmem_cache_alloc          [Tracepoint event]
kmem:kmem_cache_alloc_node     [Tracepoint event]
kmem:kmem_cache_free           [Tracepoint event]
kmem:mm_page_alloc             [Tracepoint event]
kmem:mm_page_free              [Tracepoint event]
percpu:percpu_alloc_percpu      [Tracepoint event]
percpu:percpu_free_percpu      [Tracepoint event]

Example to simulate a leak:

Write a C program (code below) that repeatedly allocates memory without freeing it.

Compile and run the program.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>

long long *fibonacci(long long *n0, long long *n1) {
    // allocate 1024 long integers for observation
    long long *v = (long long *)calloc(1024, sizeof(long long));
    *v = *n0 + *n1;
    return v;
}

void *child(void *arg) {
    long long n0 = 0;
    long long n1 = 1;
    long long *v = NULL;
    int n = 2;
    for (n = 2; n > 0; n++) {
        v = fibonacci(&n0, &n1);
        n0 = n1;
        n1 = *v;
        printf("%dth => %lld
", n, *v);
        sleep(1);
    }
}

int main(void) {
    pthread_t tid;
    pthread_create(&tid, NULL, child, NULL);
    pthread_join(tid, NULL);
    printf("main thread exit
");
    return 0;
}

Run vmstat 3 in another terminal to observe memory statistics, then execute the program and locate its PID (e.g., with ps aux | grep app). Finally, run:

sudo /usr/sbin/memleak-bpfcc -p <PID>

The output shows the leak locations (e.g., fibonacci+0x23 [leak], child+0x5a [leak]), indicating that the pointer *v was never freed.

After fixing the code to free the allocated memory, repeat the steps; the leak report disappears, confirming the fix.

Note: memleak alone cannot distinguish true leaks from normal long‑living allocations; additional analysis of the call stacks is required.

If the -p PID argument is omitted, memleak tracks kernel‑wide allocation events instead.

mmapsnoop

Use the syscall:sys_enter_mmap tracepoint to monitor the mmap system call across the whole system and print detailed mapping requests.

syscalls:sys_enter_mmap                [Tracepoint event]

Applications that allocate large memory regions often use mmap() (e.g., libc for large allocations) and later release them with munmap().

brkstack

Heap memory is typically expanded via the brk system call. Tracing brk (or its library wrapper sbrk) reveals which user‑space call stacks cause heap growth.

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_brk { printf("%s
", comm); }'

shmsnoop

shmsnoop

tracks System V shared‑memory syscalls ( shmget, shmat, shmdt, shmctl) to debug shared‑memory usage.

Example: a renderer process calls shmget and receives identifier 0x28, which is then shared with the Xorg process.

shmget() obtains or creates a shared‑memory segment.

asmlinkage long sys_shmget(key_t key, size_t size, int flag);
SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg) {
    return ksys_shmget(key, size, shmflg);
}

shmat() attaches the segment to the calling process's address space.

asmlinkage long sys_shmat(int shmid, char __user *shmaddr, int shmflg);
SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg) {
    unsigned long ret;
    long err;
    err = do_shmat(shmid, shmaddr, shmflg, &ret, SHMLBA);
    if (err)
        return err;
    force_successful_syscall_return();
    return (long)ret;
}

shmdt() detaches the segment.

asmlinkage long sys_shmdt(char __user *shmaddr);
SYSCALL_DEFINE1(shmdt, char __user *, shmaddr) {
    return ksys_shmdt(shmaddr);
}

shmctl() performs control operations on a shared‑memory segment.

asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) {
    return ksys_shmctl(shmid, cmd, buf, IPC_64);
}

faults

Tracing page‑fault events ( page_fault_user and page_fault_kernel) provides insight into memory‑usage growth, as each fault can increase RSS.

exceptions:page_fault_user               [Tracepoint event]
exceptions:page_fault_kernel             [Tracepoint event]

vmscan

Use vmscan tracepoints to observe the kswapd daemon that reclaims memory under pressure.

vmscan:mm_shrink_slab_end                [Tracepoint event]
vmscan:mm_shrink_slab_start              [Tracepoint event]
vmscan:mm_vmscan_direct_reclaim_begin    [Tracepoint event]
vmscan:mm_vmscan_direct_reclaim_end      [Tracepoint event]
vmscan:mm_vmscan_memcg_reclaim_begin     [Tracepoint event]
vmscan:mm_vmscan_memcg_reclaim_end       [Tracepoint event]
vmscan:mm_vmscan_wakeup_kswapd          [Tracepoint event]
vmscan:mm_vmscan_writepage               [Tracepoint event]

mm_shrink_slab_start / end : time spent shrinking slab caches.

mm_vmscan_direct_reclaim_begin / end : time spent in foreground direct reclaim (processes may block).

mm_vmscan_memcg_reclaim_begin / end : time spent reclaiming memory cgroups.

mm_vmscan_wakeup_kswapd : number of times kswapd is woken.

mm_vmscan_writepage : number of pages written by kswapd.

drsnoop

drsnoop

uses the mm_vmscan_direct_reclaim_begin and mm_vmscan_direct_reclaim_end tracepoints to track the direct‑reclaim portion of memory freeing, showing affected processes and latency.

Direct reclaim path:

__alloc_pages_slowpath() -> __alloc_pages_direct_reclaim() -> __perform_reclaim() -> try_to_free_pages() -> do_try_to_free_pages() -> shrink_zones() -> shrink_zone()

Key function try_to_free_pages() decides whether the current task joins the pfmemalloc_wait queue:

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
    gfp_t gfp_mask, nodemask_t *nodemask) {
    unsigned long nr_reclaimed;
    struct scan_control sc = {
        .nr_to_reclaim = SWAP_CLUSTER_MAX,
        .gfp_mask = current_gfp_context(gfp_mask),
        .reclaim_idx = gfp_zone(gfp_mask),
        .order = order,
        .nodemask = nodemask,
        .priority = DEF_PRIORITY,
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = 1,
    };
    if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
        return 1;
    set_task_reclaim_state(current, &sc.reclaim_state);
    trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
    nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
    trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
    set_task_reclaim_state(current, NULL);
    return nr_reclaimed;
}

The helper throttle_direct_reclaim() determines if the task should wait on pfmemalloc_wait based on node balance and allocation flags.

static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
    nodemask_t *nodemask) {
    struct zoneref *z;
    struct zone *zone;
    pg_data_t *pgdat = NULL;
    if (current->flags & PF_KTHREAD)
        goto out;
    if (fatal_signal_pending(current))
        goto out;
    for_each_zone_zonelist_nodemask(zone, z, zonelist,
        gfp_zone(gfp_mask), nodemask) {
        if (zone_idx(zone) > ZONE_NORMAL)
            continue;
        pgdat = zone->zone_pgdat;
        if (allow_direct_reclaim(pgdat))
            goto out;
        break;
    }
    if (!pgdat)
        goto out;
    count_vm_event(PGSCAN_DIRECT_THROTTLE);
    if (!(gfp_mask & __GFP_FS))
        wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
            allow_direct_reclaim(pgdat), HZ);
    else
        wait_event_killable(pgdat->pfmemalloc_wait,
            allow_direct_reclaim(pgdat));
    if (fatal_signal_pending(current))
        return true;
out:
    return false;
}

swapin

Trace the swap_readpage() kernel function with a kprobe to identify which process triggers a page‑in from swap.

extern int swap_readpage(struct page *page, bool do_poll);

hfaults

Trace hugetlb_fault() with a kprobe to capture detailed information about huge‑page page‑faults, including the associated mm_struct and vm_area_struct (which provides the filename).

vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
    unsigned long address, unsigned int flags);
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OOM killerMemory Debuggingmemory leak detectionbpftrace
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.