Fundamentals 12 min read

Understanding Linux Page Reclaim: LRU, Second‑Chance, and Direct Reclaim Walkthrough

This article explains Linux's page reclamation mechanisms, covering the LRU list algorithm, the second‑chance method with PG_active and PG_referenced flags, and the direct reclaim path triggered by alloc_page, including the role of kswapd, waiting queues, and key kernel functions such as lru_cache_add, mark_page_accessed, try_to_free_pages, and throttle_direct_reclaim.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
Understanding Linux Page Reclaim: LRU, Second‑Chance, and Direct Reclaim Walkthrough

Page Reclaim Overview

When physical memory runs low, the Linux kernel swaps out rarely used pages to a swap partition, a process known as page swapping or page reclaim.

Page Swap Algorithms

LRU list algorithm

Second‑chance algorithm

1. LRU List Algorithm

#define LRU_BASE 0
#define LRU_ACTIVE 1
#define LRU_FILE 2

// enum type lru_list lists the various LRU list kinds
enum lru_list {
    LRU_INACTIVE_ANON = LRU_BASE,
    LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
    LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
    LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
    LRU_UNEVICTABLE,
    NR_LRU_LISTS
};

// each page struct contains a full set of LRU lists
struct lruvec {
    struct list_head lists[NR_LRU_LISTS];
    ...
};

typedef struct pglist_data {
    ...
    struct lruvec __lruvec;
    ...
} ;

How does the LRU list age pages?

void lru_cache_add(struct page *page)
{
    struct pagevec *pvec;
    get_page(page);
    pvec = this_cpu_ptr(&lru_pvecs.lru_add);
    if (pagevec_add_and_need_flush(pvec, page))
        __pagevec_lru_add(pvec);
    local_unlock(&lru_pvecs.lock);
}

The function uses a pagevec array to batch pages before adding them to the LRU list.

#define PAGEVEC_SIZE 15

struct pagevec {
    unsigned char nr;
    bool percpu_pvec_drained;
    struct page *pages[PAGEVEC_SIZE];
};

The pagevec_add_and_need_flush function appends a page to pvec->pages[] ; if the array is full it calls __pagevec_lru_add to move the accumulated pages onto the appropriate LRU list. lru_to_page() and list_del(&page‑lru) are used together to retrieve a page from an LRU list.

#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))

2. Second‑Chance Algorithm

The kernel implements a second‑chance scheme using two flag bits on each struct page: PG_active – indicates whether the page is currently active. PG_referenced – indicates whether the page has been referenced.

When a page is examined, mark_page_accessed() decides what to do based on these flags:

void mark_page_accessed(struct page *page)
{
    page = compound_head(page);
    if (!PageReferenced(page)) {
        SetPageReferenced(page);
    } else if (PageUnevictable(page)) {
        /* ... */
    } else if (!PageActive(page)) {
        if (PageLRU(page))
            activate_page(page);
        else
            __lru_cache_activate_page(page);
        ClearPageReferenced(page);
        workingset_activation(page);
    }
    if (page_is_idle(page))
        clear_page_idle(page);
}

Key cases:

If PG_active == 0 && PG_referenced == 1, the page is moved to the active LRU list, PG_active is set to 1, and PG_referenced is cleared.

If PG_referenced == 0, the PG_referenced flag is set.

Other functions involved in the second‑chance flow include page_check_references(), page_referenced() (which walks all PTEs that map the page via the RMAP subsystem), and page_reference_one() which counts referenced PTEs.

3. Triggering Page Reclaim

The kernel can start reclaim in three ways:

Direct reclaim: When alloc_page() cannot satisfy a request because memory is scarce, the allocator falls back to the direct reclaim path.

Periodic reclaim (kswapd): The kswapd kernel thread wakes up when low‑water thresholds are crossed and reclaims pages asynchronously.

Slab reclaim: A dedicated “slab‑reaper” reclaims objects from the slab allocator.

Direct reclaim is synchronous and blocks the calling process.

Direct Reclaim Waiting Queue

If a process cannot obtain memory during direct reclaim, it may be placed on a waiting queue. When the free pages on a node satisfy the request, kswapd wakes the process.

The execution path for direct reclaim is:

__alloc_pages_slowpath()
    -> __alloc_pages_direct_reclaim()
    -> __perform_reclaim()
    -> try_to_free_pages()
    -> do_try_to_free_pages()
    -> shrink_zones()
    -> shrink_zone()

All node s may have their kswapd threads awakened when two conditions hold:

The allocation flags do not contain __GFP_NO_KSWAPD (this flag appears only in transparent huge‑page allocations).

At least one zone on the node has free pages below the high‑watermark (or the node needs memory compaction).

kswapd stops reclaiming once every zone satisfies zone_free_pages > high_watermark + reserved_pages and then wakes the waiting processes.

try_to_free_pages()

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
                               gfp_t gfp_mask, nodemask_t *nodemask)
{
    unsigned long nr_reclaimed;
    struct scan_control sc = {
        .nr_to_reclaim = SWAP_CLUSTER_MAX,
        .gfp_mask = current_gfp_context(gfp_mask),
        .reclaim_idx = gfp_zone(gfp_mask),
        .order = order,
        .nodemask = nodemask,
        .priority = DEF_PRIORITY,
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = 1,
    };

    if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
        return 1;

    set_task_reclaim_state(current, &sc.reclaim_state);
    trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);

    nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

    trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
    set_task_reclaim_state(current, NULL);

    return nr_reclaimed;
}

The function initializes reclaim parameters, checks whether reclaim should be throttled, records the start of reclaim, performs the actual reclaim via do_try_to_free_pages(), records the end, and returns the number of reclaimed pages.

throttle_direct_reclaim()

static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
                                    nodemask_t *nodemask)
{
    struct zoneref *z;
    struct zone *zone;
    pg_data_t *pgdat = NULL;

    if (current->flags & PF_KTHREAD)
        goto out;
    if (fatal_signal_pending(current))
        goto out;

    for_each_zone_zonelist_nodemask(zone, z, zonelist,
                                  gfp_zone(gfp_mask), nodemask) {
        if (zone_idx(zone) > ZONE_NORMAL)
            continue;
        pgdat = zone->zone_pgdat;
        if (allow_direct_reclaim(pgdat))
            goto out;
        break;
    }

    if (!pgdat)
        goto out;

    count_vm_event(PGSCAN_DIRECT_THROTTLE);

    if (!(gfp_mask & __GFP_FS))
        wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
                                         allow_direct_reclaim(pgdat), HZ);
    else
        wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
                            allow_direct_reclaim(pgdat));

    if (fatal_signal_pending(current))
        return true;
out:
    return false;
}

This routine checks whether the current process is a kernel thread or has received a fatal signal, then walks each zonelist entry (only ZONE_NORMAL and ZONE_DMA are considered). If the node is unbalanced, it may place the process into pgdat->pfmemalloc_wait (interruptible) or zone->zone_pgdat->pfmemalloc_wait (killable) depending on whether the allocation flag __GFP_FS is set, and finally returns whether reclaim was throttled.

Key Steps Inside throttle_direct_reclaim()

Check kernel‑thread flag ( PF_KTHREAD) or pending fatal signal – abort if true.

Iterate over zones in the zonelist, handling only normal zones; obtain the node's pgdat.

If the node allows direct reclaim, exit early.

Record a direct‑reclaim throttle event.

Depending on the presence of __GFP_FS, put the process into an interruptible or killable wait queue until the node becomes balanced.

Return true if the wait was interrupted by a signal (reclaim throttled), otherwise false.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

memory managementLinuxLRUkswapdpage reclaimdirect reclaimsecond chance
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.