Fundamentals 80 min read

Linux Page Reclaim Mechanism and Memory Compaction: Detailed Source Code Analysis

This article explains the Linux page‑reclaim mechanism, its goals, common techniques, the allocation paths, LRU data structures, and provides an in‑depth walkthrough of the kernel source code for slow‑path reclaim, direct reclaim, and memory compaction, including all relevant functions and code snippets.

Deepin Linux

Dec 9, 2023

Linux Page Reclaim Mechanism and Memory Compaction: Detailed Source Code Analysis

Page reclaim is a memory‑management technique in operating systems that frees pages no longer in use so they can be re‑allocated to other processes. When a program no longer needs a page, the OS marks it recyclable and later re‑assigns it when required.

The main purpose of page reclaim is to optimise memory utilisation and performance; by promptly freeing unused pages the system avoids waste and fragmentation, improves response time, stabilises the system, and reduces swap usage.

Common page‑reclaim techniques include on‑demand reclamation, global page‑replacement algorithms such as LRU and FIFO, and more complex policies based on working‑set models. These mechanisms are typically implemented and scheduled by the kernel according to application access patterns and resource demands.

1. Overview

As Linux continuously allocates memory, increasing memory pressure triggers per‑zone reclamation for both anonymous and file pages. For anonymous pages, seldom‑used pages are written to swap and freed to the buddy system. For file pages, clean pages are freed directly, while dirty pages are written back before being released, increasing the number of free page frames and alleviating pressure.

Understanding page reclamation requires first grasping the page‑allocation process. The allocator first tries the low‑watermark fast path; if it fails, it wakes the kernel's page‑reclaim thread for asynchronous reclamation and retries the lowest‑watermark path. If that also fails, it proceeds to the slow path, directly reclaiming pages based on the type of physical page (swap‑supported or file‑backed).

When memory pressure is detected (the slow path), the system employs three reclamation methods:

Cache reclamation , e.g., using the LRU (Least Recently Used) algorithm to free the least recently accessed pages;

Linux kernel uses LRU to select the least recently used physical pages. If a physical page is mapped into a process's virtual address space, the mapping must be removed from the page table.

Reclaiming infrequently accessed memory by writing it to the swap partition; swap acts as a disk‑backed extension of RAM, allowing pages to be swapped out and later swapped back in when accessed.

Out‑of‑Memory (OOM) killing , where the kernel terminates processes that consume large amounts of memory. OOM scores processes based on memory and CPU usage; higher scores increase the likelihood of termination, protecting the system.

Both cache reclamation and infrequently accessed memory reclamation typically use the LRU algorithm to select victim pages.

2. Page Reclaim Mechanism

When a page allocation request arrives, the allocator first attempts the low‑watermark path. If it fails (indicating mild memory shortage), the allocator wakes the per‑node page‑reclaim kernel thread for asynchronous reclamation and then retries the lowest‑watermark path. If that also fails (indicating severe shortage), the allocator directly reclaims pages.

Different physical pages use different reclamation strategies: swap‑supported pages and file‑backed pages stored on devices.

Principle for selecting physical pages to reclaim

The Linux kernel uses the LRU (Least Recently Used) algorithm to choose the least recently used physical pages. If a physical page is mapped into a process's virtual address space, the kernel removes the virtual‑to‑physical mapping from the page table.

2.1 LRU Data Structures

The memory‑management subsystem describes physical memory with a three‑level hierarchy: node (struct pglist_data), zone (struct zone), and page (struct page). The node structure contains a struct lruvec, which holds LRU list descriptors:

typedef struct pglist_data {
    ...
    spinlock_t        lru_lock; // LRU list lock
    /* Fields commonly accessed by the page reclaim scanner */
    struct lruvec      lruvec; // LRU descriptor, contains 5 LRU lists
    ...
} pg_data_t;

struct lruvec {
    struct list_head        lists[NR_LRU_LISTS]; // 5 doubly‑linked LRU heads
    struct zone_reclaim_stat    reclaim_stat; // statistics related to reclamation
    /* Evictions & activations on the inactive file list */
    atomic_long_t            inactive_age;
    /* Refaults at the time of last reclaim cycle */
    unsigned long            refaults; // records results of the last reclaim cycle
#ifdef CONFIG_MEMCG
    struct pglist_data *pgdat; // owning node structure
#endif
};

enum lru_list {
    LRU_INACTIVE_ANON = LRU_BASE,
    LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
    LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
    LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
    LRU_UNEVICTABLE,
    NR_LRU_LISTS
};

From the enum lru_list we can see the five LRU lists:

Inactive anonymous LRU list – links rarely used anonymous pages.

Active anonymous LRU list – links frequently used anonymous pages.

Inactive file LRU list – links rarely used file pages.

Active file LRU list – links frequently used file pages.

Unevictable LRU list – links pages locked in memory with mlock, not eligible for reclamation.

The struct page descriptor holds flags (e.g., PG_locked, PG_error) and the management zone and node identifiers:

// Page descriptor flags (e.g., PG_locked, PG_error). The management zone and node ID are also stored.
struct page {
    /* Flags used by the LRU algorithm */
    /* PG_active: page is currently active; set when placed on the active LRU list */
    /* PG_referenced: page was recently accessed; set on each access */
    /* PG_lru: page is on an LRU list */
    /* PG_mlocked: page is locked in memory via mlock(), prohibiting swap out */
    /* PG_swapbacked: page relies on swap (anonymous, shmem, etc.) */
    unsigned long flags;
    ...
    union {
        /* The list the page belongs to depends on its state */
        struct list_head lru;   // linked into appropriate LRU list
        ...
    };
    ...
}

2.2 Page Reclaim Source Code Analysis

When a memory allocation function is called (

alloc_page → alloc_pages_current → __alloc_pages_nodemask

), __alloc_pages_slowpath is the heart of the allocation process. The article previously described the fast path get_page_from_freelist; now we examine the slow‑path function, located in mm/page_alloc.c:

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                        struct alloc_context *ac)
{
    bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
    const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
    struct page *page = NULL;
    unsigned int alloc_flags;
    unsigned long did_some_progress;
    enum compact_priority compact_priority;
    enum compact_result compact_result;
    int compaction_retries;
    int no_progress_loops;
    unsigned int cpuset_mems_cookie;
    int reserve_flags;

    if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
                (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
        gfp_mask &= ~__GFP_ATOMIC;

retry_cpuset:
    compaction_retries = 0;
    no_progress_loops = 0;
    compact_priority = DEF_COMPACT_PRIORITY;
    cpuset_mems_cookie = read_mems_allowed_begin();

    alloc_flags = gfp_to_alloc_flags(gfp_mask);

    ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                    ac->high_zoneidx, ac->nodemask);
    if (!ac->preferred_zoneref->zone)
        goto nopage;

    if (gfp_mask & __GFP_KSWAPD_RECLAIM)
        wake_all_kswapds(order, gfp_mask, ac);

    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page)
        goto got_pg;

    if (can_direct_reclaim &&
            (costly_order ||
               (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
            && !gfp_pfmemalloc_allowed(gfp_mask)) {
        page = __alloc_pages_direct_compact(gfp_mask, order,
                        alloc_flags, ac,
                        INIT_COMPACT_PRIORITY,
                        &compact_result);
        if (page)
            goto got_pg;

        if (costly_order && (gfp_mask & __GFP_NORETRY)) {
            if (compact_result == COMPACT_DEFERRED)
                goto nopage;
            compact_priority = INIT_COMPACT_PRIORITY;
        }
    }

retry:
    if (gfp_mask & __GFP_KSWAPD_RECLAIM)
        wake_all_kswapds(order, gfp_mask, ac);

    reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
    if (reserve_flags)
        alloc_flags = reserve_flags;

    if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
        ac->nodemask = NULL;
        ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                    ac->high_zoneidx, ac->nodemask);
    }

    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page)
        goto got_pg;

    if (!can_direct_reclaim)
        goto nopage;

    if (current->flags & PF_MEMALLOC)
        goto nopage;

    page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                            &did_some_progress);
    if (page)
        goto got_pg;

    page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
                    compact_priority, &compact_result);
    if (page)
        goto got_pg;

    if (gfp_mask & __GFP_NORETRY)
        goto nopage;

    if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
        goto nopage;

    if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                 did_some_progress > 0, &no_progress_loops))
        goto retry;

    if (did_some_progress > 0 &&
            should_compact_retry(ac, order, alloc_flags,
                compact_result, &compact_priority,
                &compaction_retries))
        goto retry;

    if (check_retry_cpuset(cpuset_mems_cookie, ac))
        goto retry_cpuset;

    page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
    if (page)
        goto got_pg;

    if (tsk_is_oom_victim(current) &&
        (alloc_flags == ALLOC_OOM ||
         (gfp_mask & __GFP_NOMEMALLOC)))
        goto nopage;

    if (did_some_progress) {
        no_progress_loops = 0;
        goto retry;
    }

nopage:
    if (check_retry_cpuset(cpuset_mems_cookie, ac))
        goto retry_cpuset;

    if (gfp_mask & __GFP_NOFAIL) {
        if (WARN_ON_ONCE(!can_direct_reclaim))
            goto fail;
        WARN_ON_ONCE(current->flags & PF_MEMALLOC);
        WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
        page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
        if (page)
            goto got_pg;
        cond_resched();
        goto retry;
    }
fail:
    warn_alloc(gfp_mask, ac->nodemask,
            "page allocation failure: order:%u", order);
got_pg:
    return page;
}

The code shows that the slow path performs many checks, wakes kswapd threads for asynchronous reclamation, and may invoke direct reclaim or compaction before finally invoking the OOM killer.

2.3 Direct Page Reclaim

In the slow‑path, after asynchronous reclamation fails, the kernel attempts direct page reclaim via __alloc_pages_direct_reclaim (also in mm/page_alloc.c):

static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
        unsigned int alloc_flags, const struct alloc_context *ac,
        unsigned long *did_some_progress)
{
    struct page *page = NULL;
    bool drained = false;
    *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
    if (unlikely(!(*did_some_progress)))
        return NULL;

retry:
    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

    if (!page && !drained) {
        unreserve_highatomic_pageblock(ac, false);
        drain_all_pages(NULL);
        drained = true;
        goto retry;
    }

    return page;
}

__alloc_pages_direct_reclaim

calls __perform_reclaim, which performs synchronous reclamation:

static int
__perform_reclaim(gfp_t gfp_mask, unsigned int order,
                    const struct alloc_context *ac)
{
    struct reclaim_state reclaim_state;
    int progress;
    unsigned int noreclaim_flag;

    cond_resched();
    cpuset_memory_pressure_bump();
    fs_reclaim_acquire(gfp_mask);
    noreclaim_flag = memalloc_noreclaim_save();
    reclaim_state.reclaimed_slab = 0;
    current->reclaim_state = &reclaim_state;
    
    progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
                                ac->nodemask);

    current->reclaim_state = NULL;
    memalloc_noreclaim_restore(noreclaim_flag);
    fs_reclaim_release(gfp_mask);
    cond_resched();
    return progress;
}

try_to_free_pages

(in mm/vmscan.c) sets up a scan_control structure and invokes do_try_to_free_pages:

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
                gfp_t gfp_mask, nodemask_t *nodemask)
{
    unsigned long nr_reclaimed;
    struct scan_control sc = {
        .nr_to_reclaim = SWAP_CLUSTER_MAX,
        .gfp_mask = current_gfp_context(gfp_mask),
        .reclaim_idx = gfp_zone(gfp_mask),
        .order = order,
        .nodemask = nodemask,
        .priority = DEF_PRIORITY,
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = 1,
    };
    // ... omitted checks ...
    if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
        return 1;
    nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
    return nr_reclaimed;
}

do_try_to_free_pages

repeatedly calls shrink_zones, which in turn calls shrink_node to reclaim memory from each zone:

static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
    // ... handle buffer‑head limits ...
    for_each_zone_zonelist_nodemask(zone, z, zonelist,
                    sc->reclaim_idx, sc->nodemask) {
        if (global_reclaim(sc)) {
            // ... handle compaction readiness ...
        }
        if (zone->zone_pgdat == last_pgdat)
            continue;
        // ... memory‑cgroup soft‑limit reclaim ...
        shrink_node(zone->zone_pgdat, sc);
    }
}

shrink_node

iterates over memory‑cgroups, invokes shrink_node_memcg, which calls shrink_list for each LRU list. shrink_list delegates to shrink_inactive_list for inactive pages, which finally calls shrink_page_list to process each isolated page, write back dirty pages, release buffers, and free the page.

When the direct reclaim flow finishes, the kernel has increased the number of free page frames, reducing memory pressure.

3. Memory Compaction Process Analysis

The function __alloc_pages_direct_compact (in mm/page_alloc.c) attempts memory compaction for high‑order allocations before reclamation:

static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
        unsigned int alloc_flags, const struct alloc_context *ac,
        enum compact_priority prio, enum compact_result *compact_result)
{
    struct page *page;
    unsigned int noreclaim_flag;

    if (!order)
        return NULL;

    noreclaim_flag = memalloc_noreclaim_save();
    *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
                                 prio);
    memalloc_noreclaim_restore(noreclaim_flag);

    if (*compact_result <= COMPACT_INACTIVE)
        return NULL;

    count_vm_event(COMPACTSTALL);
    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page) {
        struct zone *zone = page_zone(page);
        zone->compact_blockskip_flush = false;
        compaction_defer_reset(zone, order, true);
        count_vm_event(COMPACTSUCCESS);
        return page;
    }
    count_vm_event(COMPACTFAIL);
    cond_resched();
    return NULL;
}

try_to_compact_pages

iterates over each zone in the zonelist and invokes compact_zone_order for the target order:

enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
        unsigned int alloc_flags, const struct alloc_context *ac,
        enum compact_priority prio)
{
    int may_perform_io = gfp_mask & __GFP_IO;
    struct zoneref *z;
    struct zone *zone;
    enum compact_result rc = COMPACT_SKIPPED;

    if (!may_perform_io)
        return COMPACT_SKIPPED;

    for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                ac->nodemask) {
        enum compact_result status;
        if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) {
            rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
            continue;
        }
        status = compact_zone_order(zone, order, gfp_mask, prio,
                    alloc_flags, ac_classzone_idx(ac));
        rc = max(status, rc);
        if (status == COMPACT_SUCCESS) {
            compaction_defer_reset(zone, order, false);
            break;
        }
        if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
                    status == COMPACT_PARTIAL_SKIPPED))
            defer_compaction(zone, order);
        if ((prio == COMPACT_PRIO_ASYNC && need_resched()) ||
                fatal_signal_pending(current))
            break;
    }
    return rc;
}

compact_zone_order

fills a struct compact_control and forwards it to compact_zone:

static enum compact_result compact_zone_order(struct zone *zone, int order,
        gfp_t gfp_mask, enum compact_priority prio,
        unsigned int alloc_flags, int classzone_idx)
{
    enum compact_result ret;
    struct compact_control cc = {
        .nr_freepages = 0,
        .nr_migratepages = 0,
        .total_migrate_scanned = 0,
        .total_free_scanned = 0,
        .order = order,
        .gfp_mask = gfp_mask,
        .zone = zone,
        .mode = (prio == COMPACT_PRIO_ASYNC) ?
                    MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
        .alloc_flags = alloc_flags,
        .classzone_idx = classzone_idx,
        .direct_compaction = true,
        .whole_zone = (prio == MIN_COMPACT_PRIORITY),
        .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
        .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
    };
    INIT_LIST_HEAD(&cc.freepages);
    INIT_LIST_HEAD(&cc.migratepages);
    ret = compact_zone(zone, &cc);
    VM_BUG_ON(!list_empty(&cc.freepages));
    VM_BUG_ON(!list_empty(&cc.migratepages));
    return ret;
}

compact_zone

first checks whether compaction is suitable via compaction_suitable. If suitable, it scans for free pages and migratable pages, isolates them, and calls migrate_pages to move pages to the free targets. The loop continues until compact_finished reports that the migrate and free scanners have met or the zone’s watermarks are high enough.

static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
{
    enum compact_result ret;
    unsigned long start_pfn = zone->zone_start_pfn;
    unsigned long end_pfn = zone_end_pfn(zone);
    const bool sync = cc->mode != MIGRATE_ASYNC;

    cc->migratetype = gfpflags_to_migratetype(cc->gfp_mask);
    ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
                            cc->classzone_idx);
    if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
        return ret;
    VM_BUG_ON(ret != COMPACT_CONTINUE);

    if (compaction_restarting(zone, cc->order))
        __reset_isolation_suitable(zone);

    if (cc->whole_zone) {
        cc->migrate_pfn = start_pfn;
        cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
    } else {
        cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
        cc->free_pfn = zone->compact_cached_free_pfn;
        if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
            cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
            zone->compact_cached_free_pfn = cc->free_pfn;
        }
        if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
            cc->migrate_pfn = start_pfn;
            zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
            zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
        }
        if (cc->migrate_pfn == start_pfn)
            cc->whole_zone = true;
    }
    cc->last_migrated_pfn = 0;
    trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
                cc->free_pfn, end_pfn, sync);
    migrate_prep_local();
    while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
        int err;
        switch (isolate_migratepages(zone, cc)) {
        case ISOLATE_ABORT:
            ret = COMPACT_CONTENDED;
            putback_movable_pages(&cc->migratepages);
            cc->nr_migratepages = 0;
            goto out;
        case ISOLATE_NONE:
            goto check_drain;
        case ISOLATE_SUCCESS:
            ;
        }
        err = migrate_pages(&cc->migratepages, compaction_alloc,
                compaction_free, (unsigned long)cc, cc->mode,
                MR_COMPACTION);
        trace_mm_compaction_migratepages(cc->nr_migratepages, err,
                            &cc->migratepages);
        cc->nr_migratepages = 0;
        if (err) {
            putback_movable_pages(&cc->migratepages);
            if (err == -ENOMEM && !compact_scanners_met(cc)) {
                ret = COMPACT_CONTENDED;
                goto out;
            }
            if (cc->direct_compaction &&
                        (cc->mode == MIGRATE_ASYNC)) {
                cc->migrate_pfn = block_end_pfn(
                        cc->migrate_pfn - 1, cc->order);
                cc->last_migrated_pfn = 0;
            }
        }
check_drain:
        if (cc->order > 0 && cc->last_migrated_pfn) {
            int cpu;
            unsigned long current_block_start =
                block_start_pfn(cc->migrate_pfn, cc->order);
            if (cc->last_migrated_pfn < current_block_start) {
                cpu = get_cpu();
                lru_add_drain_cpu(cpu);
                drain_local_pages(zone);
                put_cpu();
                cc->last_migrated_pfn = 0;
            }
        }
    }
out:
    if (cc->nr_freepages > 0) {
        unsigned long free_pfn = release_freepages(&cc->freepages);
        cc->nr_freepages = 0;
        VM_BUG_ON(free_pfn == 0);
        free_pfn = pageblock_start_pfn(free_pfn);
        if (free_pfn > zone->compact_cached_free_pfn)
            zone->compact_cached_free_pfn = free_pfn;
    }
    count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
    count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
    trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
                cc->free_pfn, end_pfn, sync, ret);
    return ret;
}

compaction_suitable

decides whether compaction should run based on zone watermarks and fragmentation index:

enum compact_result compaction_suitable(struct zone *zone, int order,
                    unsigned int alloc_flags,
                    int classzone_idx)
{
    enum compact_result ret;
    int fragindex;
    ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
                    zone_page_state(zone, NR_FREE_PAGES));
    if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
        fragindex = fragmentation_index(zone, order);
        if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
            ret = COMPACT_NOT_SUITABLE_ZONE;
    }
    trace_mm_compaction_suitable(zone, order, ret);
    if (ret == COMPACT_NOT_SUITABLE_ZONE)
        ret = COMPACT_SKIPPED;
    return ret;
}

static enum compact_result __compaction_suitable(struct zone *zone, int order,
                    unsigned int alloc_flags,
                    int classzone_idx,
                    unsigned long wmark_target)
{
    unsigned long watermark;
    if (is_via_compact_memory(order))
        return COMPACT_CONTINUE;
    watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
    if (zone_watermark_ok(zone, order, watermark, classzone_idx,
                                alloc_flags))
        return COMPACT_SUCCESS;
    watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
                low_wmark_pages(zone) : min_wmark_pages(zone);
    watermark += compact_gap(order);
    if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
                        ALLOC_CMA, wmark_target))
        return COMPACT_SKIPPED;
    return COMPACT_CONTINUE;
}

compact_finished

(and its helper __compact_finished) determines when compaction can stop: either the migrate and free scanners meet, or a suitable free page of the required migratetype is found, or the zone’s watermarks are high enough.

static enum compact_result compact_finished(struct zone *zone,
            struct compact_control *cc)
{
    int ret;
    ret = __compact_finished(zone, cc);
    trace_mm_compaction_finished(zone, cc->order, ret);
    if (ret == COMPACT_NO_SUITABLE_PAGE)
        ret = COMPACT_CONTINUE;
    return ret;
}

static enum compact_result __compact_finished(struct zone *zone,
                        struct compact_control *cc)
{
    int ret;
    if (cc->contended || fatal_signal_pending(current))
        return COMPACT_CONTENDED;
    if (compact_scanners_met(cc)) {
        reset_cached_positions(zone);
        if (cc->direct_compaction)
            zone->compact_blockskip_flush = true;
        return cc->whole_zone ? COMPACT_COMPLETE : COMPACT_PARTIAL_SKIPPED;
    }
    if (is_via_compact_memory(cc->order))
        return COMPACT_CONTINUE;
    if (cc->finishing_block) {
        if (IS_ALIGNED(cc->migrate_pfn, pageblock_nr_pages))
            cc->finishing_block = false;
        else
            return COMPACT_CONTINUE;
    }
    for (order = cc->order; order < MAX_ORDER; order++) {
        struct free_area *area = &zone->free_area[order];
        bool can_steal;
        if (!list_empty(&area->free_list[cc->migratetype]))
            return COMPACT_SUCCESS;
        #ifdef CONFIG_CMA
        if (cc->migratetype == MIGRATE_MOVABLE &&
            !list_empty(&area->free_list[MIGRATE_CMA]))
            return COMPACT_SUCCESS;
        #endif
        if (find_suitable_fallback(area, order, cc->migratetype,
                        true, &can_steal) != -1) {
            if (cc->migratetype == MIGRATE_MOVABLE)
                return COMPACT_SUCCESS;
            if (cc->mode == MIGRATE_ASYNC ||
                IS_ALIGNED(cc->migrate_pfn, pageblock_nr_pages)) {
                return COMPACT_SUCCESS;
            }
            cc->finishing_block = true;
            return COMPACT_CONTINUE;
        }
    }
    return COMPACT_NO_SUITABLE_PAGE;
}

The core page‑migration routine is migrate_pages (in mm/migrate.c), which repeatedly extracts pages from a list, obtains a new destination page via a callback, and moves the content with unmap_and_move:

int migrate_pages(struct list_head *from, new_page_t get_new_page,
        free_page_t put_new_page, unsigned long private,
        enum migrate_mode mode, int reason)
{
    int retry = 1;
    int nr_failed = 0;
    int nr_succeeded = 0;
    int pass = 0;
    struct page *page;
    struct page *page2;
    int swapwrite = current->flags & PF_SWAPWRITE;
    int rc;

    if (!swapwrite)
        current->flags |= PF_SWAPWRITE;

    for(pass = 0; pass < 10 && retry; pass++) {
        retry = 0;
        list_for_each_entry_safe(page, page2, from, lru) {
retry:
            cond_resched();
            if (PageHuge(page))
                rc = unmap_and_move_huge_page(get_new_page,
                        put_new_page, private, page,
                        pass > 2, mode, reason);
            else
                rc = unmap_and_move(get_new_page, put_new_page,
                        private, page, pass > 2, mode,
                        reason);
            switch(rc) {
            case -ENOMEM:
                if (PageTransHuge(page) && !PageHuge(page)) {
                    lock_page(page);
                    rc = split_huge_page_to_list(page, from);
                    unlock_page(page);
                    if (!rc) {
                        list_safe_reset_next(page, page2, lru);
                        goto retry;
                    }
                }
                nr_failed++;
                goto out;
            case -EAGAIN:
                retry++;
                break;
            case MIGRATEPAGE_SUCCESS:
                nr_succeeded++;
                break;
            default:
                nr_failed++;
                break;
            }
        }
    }
    nr_failed += retry;
    rc = nr_failed;
out:
    if (nr_succeeded)
        count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
    if (nr_failed)
        count_vm_events(PGMIGRATE_FAIL, nr_failed);
    trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
    if (!swapwrite)
        current->flags &= ~PF_SWAPWRITE;
    return rc;
}

The migration callbacks used by the compaction path are compaction_alloc and compaction_free:

static struct page *compaction_alloc(struct page *migratepage,
                    unsigned long data)
{
    struct compact_control *cc = (struct compact_control *)data;
    struct page *freepage;
    if (list_empty(&cc->freepages)) {
        if (!cc->contended)
            isolate_freepages(cc);
        if (list_empty(&cc->freepages))
            return NULL;
    }
    freepage = list_entry(cc->freepages.next, struct page, lru);
    list_del(&freepage->lru);
    cc->nr_freepages--;
    return freepage;
}

static void compaction_free(struct page *page, unsigned long data)
{
    struct compact_control *cc = (struct compact_control *)data;
    list_add(&page->lru, &cc->freepages);
    cc->nr_freepages++;
}

The actual movement is performed by unmap_and_move, which obtains a destination page via the get_new_page callback (here compaction_alloc) and then calls __unmap_and_move to copy or migrate the contents:

static ICE_noinline int unmap_and_move(new_page_t get_new_page,
                   free_page_t put_new_page,
                   unsigned long private, struct page *page,
                   int force, enum migrate_mode mode,
                   enum migrate_reason reason)
{
    int rc = MIGRATEPAGE_SUCCESS;
    struct page *newpage;
    if (!thp_migration_supported() && PageTransHuge(page))
        return -ENOMEM;
    newpage = get_new_page(page, private);
    if (!newpage)
        return -ENOMEM;
    if (page_count(page) == 1) {
        ClearPageActive(page);
        ClearPageUnevictable(page);
        if (unlikely(__PageMovable(page))) {
            lock_page(page);
            if (!PageMovable(page))
                __ClearPageIsolated(page);
            unlock_page(page);
        }
        if (put_new_page)
            put_new_page(newpage, private);
        else
            put_page(newpage);
        goto out;
    }
    rc = __unmap_and_move(page, newpage, force, mode);
    if (rc == MIGRATEPAGE_SUCCESS)
        set_page_owner_migrate_reason(newpage, reason);
out:
    if (rc != -EAGAIN) {
        list_del(&page->lru);
        if (likely(!__PageMovable(page)))
            mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
                    page_is_file_cache(page), -hpage_nr_pages(page));
    }
    if (rc == MIGRATEPAGE_SUCCESS) {
        put_page(page);
        if (reason == MR_MEMORY_FAILURE) {
            if (set_hwpoison_free_buddy_page(page))
                num_poisoned_pages_inc();
        }
    } else {
        if (rc != -EAGAIN) {
            if (likely(!__PageMovable(page))) {
                putback_lru_page(page);
                goto put_new;
            }
            lock_page(page);
            if (PageMovable(page))
                putback_movable_page(page);
            else
                __ClearPageIsolated(page);
            unlock_page(page);
            put_page(page);
        }
put_new:
        if (put_new_page)
            put_new_page(newpage, private);
        else
            put_page(newpage);
    }
    return rc;
}

static int __unmap_and_move(struct page *page, struct page *newpage,
                int force, enum migrate_mode mode)
{
    int rc = -EAGAIN;
    int page_was_mapped = 0;
    struct anon_vma *anon_vma = NULL;
    bool is_lru = !__PageMovable(page);
    if (!trylock_page(page)) {
        if (!force || mode == MIGRATE_ASYNC)
            goto out;
        if (current->flags & PF_MEMALLOC)
            goto out;
        lock_page(page);
    }
    if (PageWriteback(page)) {
        switch (mode) {
        case MIGRATE_SYNC:
        case MIGRATE_SYNC_NO_COPY:
            break;
        default:
            rc = -EBUSY;
            goto out_unlock;
        }
        if (!force)
            goto out_unlock;
        wait_on_page_writeback(page);
    }
    if (PageAnon(page) && !PageKsm(page))
        anon_vma = page_get_anon_vma(page);
    if (PageAnon(page) && !PageKsm(page)) {
        // ... handling omitted for brevity ...
    }
    if (page_mapped(page)) {
        try_to_unmap(page,
            TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
        page_was_mapped = 1;
    }
    if (!page_mapped(page))
        rc = move_to_new_page(newpage, page, mode);
    if (page_was_mapped)
        remove_migration_ptes(page,
            rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
out_unlock_both:
    if (anon_vma)
        put_anon_vma(anon_vma);
    unlock_page(page);
out:
    if (rc == MIGRATEPAGE_SUCCESS) {
        if (unlikely(!is_lru))
            put_page(newpage);
        else
            putback_lru_page(newpage);
    }
    return rc;
}

move_to_new_page

performs the actual data migration. For file‑backed pages it calls the filesystem’s migratepage callback; for anonymous pages it uses migrate_page or a fallback implementation:

static int move_to_new_page(struct page *newpage, struct page *page,
                enum migrate_mode mode)
{
    struct address_space *mapping;
    int rc = -EAGAIN;
    bool is_lru = !__PageMovable(page);
    VM_BUG_ON_PAGE(!PageLocked(page), page);
    VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
    mapping = page_mapping(page);
    if (likely(is_lru)) {
        if (!mapping)
            rc = migrate_page(mapping, newpage, page, mode);
        else if (mapping->a_ops->migratepage)
            rc = mapping->a_ops->migratepage(mapping, newpage,
                            page, mode);
        else
            rc = fallback_migrate_page(mapping, newpage,
                            page, mode);
    } else {
        VM_BUG_ON_PAGE(!PageIsolated(page), page);
        if (!PageMovable(page)) {
            rc = MIGRATEPAGE_SUCCESS;
            __ClearPageIsolated(page);
            goto out;
        }
        rc = mapping->a_ops->migratepage(mapping, newpage,
                        page, mode);
        WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
            !PageIsolated(page));
    }
    if (rc == MIGRATEPAGE_SUCCESS) {
        if (__PageMovable(page)) {
            VM_BUG_ON_PAGE(!PageIsolated(page), page);
            __ClearPageIsolated(page);
        }
        if (!PageMappingFlags(page))
            page->mapping = NULL;
        if (unlikely(is_zone_device_page(newpage))) {
            if (is_device_public_page(newpage))
                flush_dcache_page(newpage);
        } else
            flush_dcache_page(newpage);
    }
out:
    return rc;
}

If the filesystem does not provide a migratepage operation, fallback_migrate_page handles dirty pages (writes them back synchronously for sync migrations) and releases private buffers before finally calling migrate_page:

static int fallback_migrate_page(struct address_space *mapping,
        struct page *newpage, struct page *page, enum migrate_mode mode)
{
    if (PageDirty(page)) {
        switch (mode) {
        case MIGRATE_SYNC:
        case MIGRATE_SYNC_NO_COPY:
            break;
        default:
            return -EBUSY;
        }
        return writeout(mapping, page);
    }
    if (page_has_private(page) &&
        !try_to_release_page(page, GFP_KERNEL))
        return -EAGAIN;
    return migrate_page(mapping, newpage, page, mode);
}

When all reclamation and compaction attempts fail, the allocator may invoke the OOM path via __alloc_pages_may_oom (in mm/page_alloc.c), which tries a final high‑watermark allocation and, if still unsuccessful, calls out_of_memory to select a victim process. For __GFP_NOFAIL allocations it falls back to a no‑watermark allocation after OOM handling.

static struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
    const struct alloc_context *ac, unsigned long *did_some_progress)
{
    struct oom_control oc = {
        .zonelist = ac->zonelist,
        .nodemask = ac->nodemask,
        .memcg = NULL,
        .gfp_mask = gfp_mask,
        .order = order,
    };
    struct page *page;
    *did_some_progress = 0;
    if (!mutex_trylock(&oom_lock)) {
        *did_some_progress = 1;
        schedule_timeout_uninterruptible(1);
        return NULL;
    }
    page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
                      ~__GFP_DIRECT_RECLAIM, order,
                      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
    if (page)
        goto out;
    if (current->flags & PF_DUMPCORE)
        goto out;
    if (order > PAGE_ALLOC_COSTLY_ORDER)
        goto out;
    if (gfp_mask & __GFP_RETRY_MAYFAIL)
        goto out;
    if (ac->high_zoneidx < ZONE_NORMAL)
        goto out;
    if (pm_suspended_storage())
        goto out;
    if (gfp_mask & __GFP_THISNODE)
        goto out;
    if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
        *did_some_progress = 1;
        if (gfp_mask & __GFP_NOFAIL)
            page = __alloc_pages_cpuset_fallback(gfp_mask, order,
                    ALLOC_NO_WATERMARKS, ac);
    }
out:
    mutex_unlock(&oom_lock);
    return page;
}

Overall, the article provides a comprehensive walkthrough of Linux’s page‑reclaim and memory‑compaction mechanisms, detailing the decision‑making process, the interaction between asynchronous and synchronous reclamation, the LRU data structures, and the intricate code paths that maintain system stability under memory pressure.

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.