Fundamentals 37 min read

Inside Linux Memory Compaction: A Source‑Code Walkthrough of Memory Management

The article explains how Linux manages memory page watermarks, when the allocator falls back to kswapd, and the exact conditions that trigger direct compaction via __alloc_pages_direct_compact, then walks through the core compaction functions—try_to_compact_pages, compact_zone_order, compact_zone, and the page‑migration helpers—illustrated with flow diagrams and real kernel code.

Linux Kernel Journey

Oct 17, 2024

Inside Linux Memory Compaction: A Source‑Code Walkthrough of Memory Management

1. Memory Page Watermarks

The page allocator uses three watermark levels (High, Low, Min) for each zone. Memory below the Min watermark is reserved for the system and ordinary allocation requests cannot use it unless special conditions are met.

The following diagram shows the zone‑watermark management flow.

2. From Allocation to Direct Compaction

When allocation on the slow path fails at the Min watermark, the kernel wakes kswapd to reclaim memory. If reclamation still cannot satisfy the request, the allocator checks three conditions and may call __alloc_pages_direct_compact() to perform direct compaction.

Direct page‑reclaim is allowed.

The request has a high‑cost order (e.g., costly_order) and needs a contiguous block.

Access to system‑reserved memory is prohibited unless the flag ALLOC_NO_WATERMARKS is set.

2.1 __alloc_pages_direct_compact()

This function first invokes try_to_compact_pages() to compact memory. If compaction succeeds and a page is captured, prep_new_page() prepares the page for allocation; otherwise it falls back to get_page_from_freelist().

static struct page *__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
    unsigned int alloc_flags, const struct alloc_context *ac,
    enum compact_priority prio, enum compact_result *compact_result)
{
    struct page *page = NULL;
    unsigned long pflags;
    unsigned int noreclaim_flag;

    if (!order)
        return NULL;

    psi_memstall_enter(&pflags);
    noreclaim_flag = memalloc_noreclaim_save();

    *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
                                         prio, &page);
    memalloc_noreclaim_restore(noreclaim_flag);
    psi_memstall_leave(&pflags);
    count_vm_event(COMPACTSTALL);

    if (page)
        prep_new_page(page, order, gfp_mask, alloc_flags);

    if (!page)
        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

    if (page) {
        struct zone *zone = page_zone(page);
        zone->compact_blockskip_flush = false;
        compaction_defer_reset(zone, order, true);
        count_vm_event(COMPACTSUCCESS);
        return page;
    }
    count_vm_event(COMPACTFAIL);
    cond_resched();
    return NULL;
}

2.2 try_to_compact_pages()

The function iterates over all zones in the target node, invoking compact_zone_order() for each zone. It records the result in compact_result and handles defer‑logic, rescheduling, and fatal‑signal checks.

enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
    unsigned int alloc_flags, const struct alloc_context *ac,
    enum compact_priority prio, struct page **capture)
{
    int may_perform_io = gfp_mask & __GFP_IO;
    struct zoneref *z;
    struct zone *zone;
    enum compact_result rc = COMPACT_SKIPPED;

    if (!may_perform_io)
        return COMPACT_SKIPPED;
    trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);

    for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                   ac->nodemask) {
        enum compact_result status;
        if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) {
            rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
            continue;
        }
        status = compact_zone_order(zone, order, gfp_mask, prio,
                                   alloc_flags, ac_classzone_idx(ac), capture);
        rc = max(status, rc);
        if (status == COMPACT_SUCCESS) {
            compaction_defer_reset(zone, order, false);
            break;
        }
        if (prio != COMPACT_PRIO_ASYNC &&
            (status == COMPACT_COMPLETE || status == COMPACT_PARTIAL_SKIPPED))
            defer_compaction(zone, order);
        if ((prio == COMPACT_PRIO_ASYNC && need_resched()) ||
            fatal_signal_pending(current))
            break;
    }
    return rc;
}

2.3 compact_zone_order()

This helper builds a compact_control structure, initializes a capture_control, and calls compact_zone() to perform the actual compaction work.

static enum compact_result compact_zone_order(struct zone *zone, int order,
    gfp_t gfp_mask, enum compact_priority prio,
    unsigned int alloc_flags, int classzone_idx, struct page **capture)
{
    struct compact_control cc = {
        .order = order,
        .search_order = order,
        .gfp_mask = gfp_mask,
        .zone = zone,
        .mode = (prio == COMPACT_PRIO_ASYNC) ? MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
        .alloc_flags = alloc_flags,
        .classzone_idx = classzone_idx,
        .direct_compaction = true,
        .whole_zone = (prio == MIN_COMPACT_PRIORITY),
        .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
        .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
    };
    struct capture_control capc = { .cc = &cc, .page = NULL };
    if (capture)
        current->capture_control = &capc;
    ret = compact_zone(&cc, &capc);
    VM_BUG_ON(!list_empty(&cc.freepages));
    VM_BUG_ON(!list_empty(&cc.migratepages));
    *capture = capc.page;
    current->capture_control = NULL;
    return ret;
}

3. Memory Fragmentation

Fragmentation consists of many scattered free pages. Over time, fragmentation grows, causing large‑order allocations to fail. Pages are classified as reclaimable, movable, or unmovable; movable pages can be compacted by changing their mappings, while reclaimable pages can be freed.

4. Core Compaction Functions

4.1 compact_zone()

This is the heart of the compaction algorithm. It runs two scanners: one from the start of the zone looking for movable pages, and another from the end looking for free pages. When the scanners meet or enough contiguous pages are found, the function migrates pages via migrate_pages() and releases the freed pages.

static enum compact_result compact_zone(struct compact_control *cc,
                                      struct capture_control *capc)
{
    unsigned long start_pfn = cc->zone->zone_start_pfn;
    unsigned long end_pfn = zone_end_pfn(cc->zone);
    unsigned long last_migrated_pfn;
    bool sync = cc->mode != MIGRATE_ASYNC;
    bool update_cached;

    cc->total_migrate_scanned = 0;
    cc->total_free_scanned = 0;
    cc->nr_migratepages = 0;
    cc->nr_freepages = 0;
    INIT_LIST_HEAD(&cc->freepages);
    INIT_LIST_HEAD(&cc->migratepages);
    cc->migratetype = gfpflags_to_migratetype(cc->gfp_mask);

    if (compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
                            cc->classzone_idx) == COMPACT_SUCCESS)
        return COMPACT_SUCCESS;
    if (compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
                            cc->classzone_idx) != COMPACT_CONTINUE)
        return COMPACT_SKIPPED;

    /* initialise scan positions */
    if (cc->whole_zone) {
        cc->migrate_pfn = start_pfn;
        cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
    } else {
        cc->migrate_pfn = cc->zone->compact_cached_migrate_pfn[sync];
        cc->free_pfn = cc->zone->compact_cached_free_pfn;
        if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn)
            cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
        if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
            cc->migrate_pfn = start_pfn;
            cc->zone->compact_cached_migrate_pfn[0] = start_pfn;
            cc->zone->compact_cached_migrate_pfn[1] = start_pfn;
        }
        if (cc->migrate_pfn <= cc->zone->compact_init_migrate_pfn)
            cc->whole_zone = true;
    }
    last_migrated_pfn = 0;
    update_cached = !sync &&
        cc->zone->compact_cached_migrate_pfn[0] == cc->zone->compact_cached_migrate_pfn[1];

    trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
                             cc->free_pfn, end_pfn, sync);
    migrate_prep_local();

    while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
        int err;
        unsigned long start_pfn = cc->migrate_pfn;
        cc->rescan = false;
        if (pageblock_start_pfn(last_migrated_pfn) ==
            pageblock_start_pfn(start_pfn))
            cc->rescan = true;

        switch (isolate_migratepages(cc)) {
        case ISOLATE_ABORT:
            ret = COMPACT_CONTENDED;
            putback_movable_pages(&cc->migratepages);
            cc->nr_migratepages = 0;
            last_migrated_pfn = 0;
            goto out;
        case ISOLATE_NONE:
            if (update_cached) {
                cc->zone->compact_cached_migrate_pfn[1] =
                    cc->zone->compact_cached_migrate_pfn[0];
            }
            goto check_drain;
        case ISOLATE_SUCCESS:
            update_cached = false;
            last_migrated_pfn = start_pfn;
            break;
        }

        err = migrate_pages(&cc->migratepages, compaction_alloc,
                           compaction_free, (unsigned long)cc,
                           cc->mode, MR_COMPACTION);
        trace_mm_compaction_migratepages(cc->nr_migratepages, err,
                                        &cc->migratepages);
        cc->nr_migratepages = 0;
        if (err) {
            putback_movable_pages(&cc->migratepages);
            if (err == -ENOMEM && !compact_scanners_met(cc)) {
                ret = COMPACT_CONTENDED;
                goto out;
            }
            if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
                cc->migrate_pfn = block_end_pfn(cc->migrate_pfn - 1, cc->order);
                last_migrated_pfn = 0;
            }
        }

    check_drain:
        if (cc->order > 0 && last_migrated_pfn) {
            unsigned long current_block_start =
                block_start_pfn(cc->migrate_pfn, cc->order);
            if (last_migrated_pfn < current_block_start) {
                int cpu = get_cpu();
                lru_add_drain_cpu(cpu);
                drain_local_pages(cc->zone);
                put_cpu();
                last_migrated_pfn = 0;
            }
        }
        if (capc && capc->page) {
            ret = COMPACT_SUCCESS;
            break;
        }
    }

out:
    if (cc->nr_freepages > 0) {
        unsigned long free_pfn = release_freepages(&cc->freepages);
        cc->nr_freepages = 0;
        VM_BUG_ON(free_pfn == 0);
        free_pfn = pageblock_start_pfn(free_pfn);
        if (free_pfn > cc->zone->compact_cached_free_pfn)
            cc->zone->compact_cached_free_pfn = free_pfn;
    }
    count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
    count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
    trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
                          cc->free_pfn, end_pfn, sync, ret);
    return ret;
}

4.2 compact_suitable()

Decides whether a zone needs compaction based on its watermarks and, for high‑order allocations, the fragmentation index.

enum compact_result compaction_suitable(struct zone *zone, int order,
    unsigned int alloc_flags, int classzone_idx)
{
    enum compact_result ret;
    int fragindex;

    ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
                               zone_page_state(zone, NR_FREE_PAGES));
    if (ret == COMPACT_CONTINUE && order > PAGE_ALLOC_COSTLY_ORDER) {
        fragindex = fragmentation_index(zone, order);
        if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
            ret = COMPACT_NOT_SUITABLE_ZONE;
    }
    trace_mm_compaction_suitable(zone, order, ret);
    if (ret == COMPACT_NOT_SUITABLE_ZONE)
        ret = COMPACT_SKIPPED;
    return ret;
}

4.3 __compaction_suitable()

Checks if the current watermark satisfies the allocation request; if not, it may trigger compaction.

static enum compact_result __compaction_suitable(struct zone *zone, int order,
    unsigned int alloc_flags, int classzone_idx, unsigned long wmark_target)
{
    unsigned long watermark;

    if (is_via_compact_memory(order))
        return COMPACT_CONTINUE;

    watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
    if (zone_watermark_ok(zone, order, watermark, classzone_idx, alloc_flags))
        return COMPACT_SUCCESS;

    watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
        low_wmark_pages(zone) : min_wmark_pages(zone);
    watermark += compact_gap(order);

    if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
                           ALLOC_CMA, wmark_target))
        return COMPACT_SKIPPED;
    return COMPACT_CONTINUE;
}

4.4 compact_finished()

Determines whether the current compaction pass has completed by checking if the migrate and free scanners have met.

static enum compact_result compact_finished(struct compact_control *cc)
{
    int ret = __compact_finished(cc);
    trace_mm_compaction_finished(cc->zone, cc->order, ret);
    if (ret == COMPACT_NO_SUITABLE_PAGE)
        ret = COMPACT_CONTINUE;
    return ret;
}

4.5 isolate_migratepages()

Scans a zone for pages that can be moved. It starts from the last scanned position, walks page‑blocks, and adds suitable pages to cc->migratepages. The function respects defer‑logic and may abort if isolation fails.

static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
{
    unsigned long block_start_pfn, block_end_pfn, low_pfn;
    struct page *page;
    bool fast_find_block;
    unsigned int isolate_mode = (sysctl_compact_unevictable_allowed ?
                                ISOLATE_UNEVICTABLE : 0) |
                               (cc->mode != MIGRATE_SYNC ?
                                ISOLATE_ASYNC_MIGRATE : 0);

    low_pfn = fast_find_migrateblock(cc);
    block_start_pfn = pageblock_start_pfn(low_pfn);
    if (block_start_pfn < cc->zone->zone_start_pfn)
        block_start_pfn = cc->zone->zone_start_pfn;
    fast_find_block = low_pfn != cc->migrate_pfn && !cc->fast_search_fail;
    block_end_pfn = pageblock_end_pfn(low_pfn);

    for (; block_end_pfn <= cc->free_pfn;
         fast_find_block = false,
         low_pfn = block_end_pfn,
         block_start_pfn = block_end_pfn,
         block_end_pfn += pageblock_nr_pages) {
        if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages)))
            cond_resched();
        page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn, cc->zone);
        if (!page)
            continue;
        if (IS_ALIGNED(low_pfn, pageblock_nr_pages) &&
            !fast_find_block && !isolation_suitable(cc, page))
            continue;
        if (!suitable_migration_source(cc, page)) {
            update_cached_migrate(cc, block_end_pfn);
            continue;
        }
        low_pfn = isolate_migratepages_block(cc, low_pfn, block_end_pfn,
                                            isolate_mode);
        if (!low_pfn)
            return ISOLATE_ABORT;
        break;
    }
    cc->migrate_pfn = low_pfn;
    return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

5. Page Migration

5.1 migrate_pages()

Moves pages from the migratepages list to newly allocated free pages. It retries up to ten passes, handling huge pages separately and falling back to splitting huge pages when allocation fails.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
                 free_page_t put_new_page, unsigned long private,
                 enum migrate_mode mode, int reason)
{
    int retry = 1, nr_failed = 0, nr_succeeded = 0, pass = 0;
    struct page *page, *page2;
    int swapwrite = current->flags & PF_SWAPWRITE;
    int rc;

    if (!swapwrite)
        current->flags |= PF_SWAPWRITE;
    for (pass = 0; pass < 10 && retry; pass++) {
        retry = 0;
        list_for_each_entry_safe(page, page2, from, lru) {
        retry_label:
            cond_resched();
            if (PageHuge(page))
                rc = unmap_and_move_huge_page(get_new_page, put_new_page,
                                             private, page, pass > 2,
                                             mode, reason);
            else
                rc = unmap_and_move(get_new_page, put_new_page,
                                    private, page, pass > 2, mode, reason);
            switch (rc) {
            case -ENOMEM:
                if (PageTransHuge(page) && !PageHuge(page)) {
                    lock_page(page);
                    rc = split_huge_page_to_list(page, from);
                    unlock_page(page);
                    if (!rc) {
                        list_safe_reset_next(page, page2, lru);
                        goto retry_label;
                    }
                }
                nr_failed++;
                goto out;
            case -EAGAIN:
                retry++;
                break;
            case MIGRATEPAGE_SUCCESS:
                nr_succeeded++;
                break;
            default:
                nr_failed++;
                break;
            }
        }
    }
    nr_failed += retry;
    rc = nr_failed;
out:
    if (nr_succeeded)
        count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
    if (nr_failed)
        count_vm_events(PGMIGRATE_FAIL, nr_failed);
    trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
    if (!swapwrite)
        current->flags &= ~PF_SWAPWRITE;
    return rc;
}

5.2 compaction_alloc()

Callback used during migration to obtain a free page from the freepages list, isolating more free pages if the list is empty.

static struct page *compaction_alloc(struct page *migratepage,
                                    unsigned long data)
{
    struct compact_control *cc = (struct compact_control *)data;
    struct page *freepage;

    if (list_empty(&cc->freepages)) {
        isolate_freepages(cc);
        if (list_empty(&cc->freepages))
            return NULL;
    }
    freepage = list_entry(cc->freepages.next, struct page, lru);
    list_del(&freepage->lru);
    cc->nr_freepages--;
    return freepage;
}

5.3 isolate_freepages()

Finds free pages for migration by scanning page‑blocks backwards from cc->free_pfn, respecting the current compaction mode and updating cached positions.

static void isolate_freepages(struct compact_control *cc)
{
    struct zone *zone = cc->zone;
    struct page *page;
    unsigned long block_start_pfn, isolate_start_pfn, block_end_pfn, low_pfn;
    struct list_head *freelist = &cc->freepages;
    unsigned int stride = (cc->mode == MIGRATE_ASYNC) ? COMPACT_CLUSTER_MAX : 1;

    isolate_start_pfn = fast_isolate_freepages(cc);
    if (cc->nr_freepages)
        goto splitmap;

    isolate_start_pfn = cc->free_pfn;
    block_start_pfn = pageblock_start_pfn(isolate_start_pfn);
    block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
                        zone_end_pfn(zone));
    low_pfn = pageblock_end_pfn(cc->migrate_pfn);

    for (; block_start_pfn >= low_pfn;
         block_end_pfn = block_start_pfn,
         block_start_pfn -= pageblock_nr_pages,
         isolate_start_pfn = block_start_pfn) {
        unsigned long nr_isolated;
        if (!(block_start_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages)))
            cond_resched();
        page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn, zone);
        if (!page)
            continue;
        if (!suitable_migration_target(cc, page))
            continue;
        if (!isolation_suitable(cc, page))
            continue;
        nr_isolated = isolate_freepages_block(cc, &isolate_start_pfn,
                                             block_end_pfn, freelist,
                                             stride, false);
        if (isolate_start_pfn == block_end_pfn)
            update_pageblock_skip(cc, page, block_start_pfn);
        if (cc->nr_freepages >= cc->nr_migratepages) {
            if (isolate_start_pfn >= block_end_pfn)
                isolate_start_pfn = block_start_pfn - pageblock_nr_pages;
            break;
        } else if (isolate_start_pfn < block_end_pfn) {
            break;
        }
        if (nr_isolated) {
            stride = 1;
            continue;
        }
        stride = min_t(unsigned int, COMPACT_CLUSTER_MAX, stride << 1);
    }
    cc->free_pfn = isolate_start_pfn;

splitmap:
    split_map_pages(freelist);
}

The article concludes that the core of Linux memory compaction lies in the interplay of water‑mark checks, the compact_zone two‑scanner algorithm, and the migration helpers that move pages to create larger contiguous free blocks. Understanding these functions and their control flow is essential for kernel developers working on memory management or performance tuning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

memory management compaction kernel C Linux fragmentation page-allocation

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.