Efficient Dynamic Memory Reclamation Techniques in Linux
The article provides an in‑depth technical analysis of Linux’s memory reclamation system, covering zones, watermarks, page flags, LRU lists, the shrink_zone workflow, kswapd, direct reclaim, OOM handling, page‑cache structures, and practical tuning tips for optimal performance.
1. Overview of Memory Reclamation
In a Linux system, memory acts as the "blood" of the computer, temporarily holding data for the CPU and serving as a bridge between the CPU and external storage. The state of memory directly influences overall system speed and stability; poor memory management can cause performance degradation or even system crashes.
2. Linux Memory Reclamation Mechanism
2.1 Main Objects of Reclamation
The kernel focuses on two primary page types: anonymous pages (e.g., heap, stack, process‑mapped memory) which have no backing file, and file pages (kernel cache of disk data) which are backed by a file on disk. Anonymous pages are reclaimed by moving them to the swap area, while clean file pages can be freed directly without disk I/O.
2.2 Zone‑Based Reclamation Rules
Each memory zone maintains three watermarks stored in zone->watermark[NR_WMARK]:
WMARK_MIN : Used for slow allocation when fast allocation fails.
WMARK_LOW : The default threshold for fast allocation; if free pages fall below this, the kernel triggers fast reclamation.
WMARK_HIGH : Desired free‑page level; reclamation aims to raise free pages above this value.
The relationship is min < low < high. Values are computed during boot based on total memory and per‑zone page counts (e.g., low = min + min/4, high = min + min/2) and can be inspected via /proc/zoneinfo.
2.3 Important Page Flags
PG_lru: Page is on an LRU list. PG_referenced: Recently accessed (only for file pages). PG_dirty: Modified but not yet written back. PG_active: Belongs to the active LRU list. PG_private: page->private holds private data. PG_writeback: Currently being written to disk. PG_swapbacked: Can be written to swap (non‑file pages). PG_swapcache: Page is in the swap cache. PG_reclaim: Marked for reclamation. PG_mlocked: Locked in memory.
2.4 Scanning Control Structure
/* Scan control structure used for reclamation and compaction */
struct scan_control {
unsigned long nr_to_reclaim; /* pages we aim to free */
gfp_t gfp_mask; /* allocation flags */
int order; /* allocation order */
nodemask_t *nodemask; /* node mask for NUMA */
struct mem_cgroup *target_mem_cgroup; /* cgroup target */
int priority; /* scan priority */
unsigned int may_writepage:1;/* can we write pages? */
unsigned int may_unmap:1; /* can we unmap pages? */
unsigned int may_swap:1; /* can we swap? */
unsigned int hibernation_mode:1;
unsigned int compaction_ready:1;
unsigned long nr_scanned; /* pages examined */
unsigned long nr_reclaimed; /* pages actually reclaimed */
};The fields may_writepage, may_unmap and may_swap are set according to the reclamation context (kswapd, direct reclaim, etc.).
2.5 Kernel Configuration Parameters
/proc/sys/vm/zone_reclaim_mode: Controls fast reclamation (0‑disable, 1‑enable, 2‑allow writeback, 4‑allow unmap). /proc/sys/vm/laptop_mode: Affects direct reclaim writeback (0‑allow, non‑zero‑disallow). /proc/sys/vm/swappiness: Determines the balance between reclaiming anonymous vs. file pages (0‑only file pages, 200‑only anonymous pages, default 30).
3. Reclamation Workflows
3.1 Periodic Scanning (kswapd)
kswapd is a kernel daemon that periodically scans zones. It uses the three watermarks to decide whether to perform fast reclamation, direct reclaim, or to stop. The thresholds are pages_min, pages_low, and pages_high. When free pages fall below pages_low, kswapd attempts to raise the free count above pages_high.
3.2 Direct Reclaim
When a process requests memory and the allocator cannot satisfy the request, the kernel performs direct reclaim. The requesting process is blocked while the kernel tries to free enough pages to satisfy the allocation. Direct reclaim is more aggressive and can increase CPU load.
3.3 OOM Killer
If both fast and direct reclaim fail, the OOM (Out‑of‑Memory) killer selects a process to terminate based on a score that accounts for memory usage, importance, and other factors, freeing a large amount of memory.
4. What Memory Is Reclaimed?
4.1 Page Cache
Page cache stores copies of disk data in RAM to avoid slow disk I/O. When a process reads a file, the kernel first checks the page cache (cache hit). If the data is not present (cache miss), it reads from disk and populates the cache.
4.2 From Radix Tree to XArray
Historically the page cache used a radix tree where each node holds 64 slots. The tree depth determines how many pages can be stored (e.g., depth 2 → 64 pages, depth 3 → 4096 pages). Modern kernels replace the radix tree with an XArray for better performance.
4.3 Reverse Mapping
To free a page, the kernel must clear all PTEs that reference it. Reverse mapping derives the virtual addresses from the page’s mapping and index fields, allowing the kernel to locate and invalidate the corresponding VMA entries.
5. Detailed Reclamation Process
5.1 Reclaiming a Zone
The entry point for any reclamation is shrink_zone(). It receives a struct scan_control and a target zone. The function may loop multiple times, trying to free at least 2^(order+1) pages to avoid repeated allocations.
5.2 Scanning LRU Vectors
Inside shrink_zone(), the kernel iterates over memory cgroups (memcgs) and calls shrink_lruvec() for each lruvec. The swapiness value influences how many pages are taken from anonymous vs. file LRU lists.
static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
/* Compute how many pages each LRU list should scan */
get_scan_count(lruvec, swappiness, sc, nr);
memcpy(targets, nr, sizeof(nr));
/* Loop while there are pages left to scan in the target lists */
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
/* Scan each list (max SWAP_CLUSTER_MAX = 32 pages per iteration) */
for_each_evictable_lru(lru) {
if (nr[lru]) {
nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
nr[lru] -= nr_to_scan;
nr_reclaimed += shrink_list(lru, nr_to_scan,
lruvec, sc);
}
}
/* Additional logic to adjust the scan target based on
how many pages were reclaimed ... */
}
/* If the inactive anon list is low, pull pages from the active list */
if (inactive_anon_is_low(lruvec))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc,
LRU_ACTIVE_ANON);
/* Throttle writeback if too many dirty pages are being written */
throttle_vm_writeout(sc->gfp_mask);
}The function processes four LRU lists: active anon, inactive anon, active file, inactive file. It never touches the unevictable list.
5.3 Handling Active LRU Lists
When an active list is low, shrink_active_list() isolates pages from the tail of the active list, checks whether they are referenced, and either moves them to the inactive list or frees them if page->_count == 0. The function respects may_unmap and may_writepage flags.
static void shrink_active_list(unsigned long nr_to_scan,
struct lruvec *lruvec,
struct scan_control *sc,
enum lru_list lru)
{
LIST_HEAD(l_hold);
LIST_HEAD(l_active);
LIST_HEAD(l_inactive);
/* Drain per‑CPU pagevecs */
lru_add_drain();
/* Isolate pages from the tail of the active list */
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
/* Process each isolated page */
while (!list_empty(&l_hold)) {
struct page *page = lru_to_page(&l_hold);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
putback_lru_page(page);
continue;
}
if (page_referenced(page, 0, sc->target_mem_cgroup, &vm_flags)) {
/* Recently used – keep it active */
list_add(&page->lru, &l_active);
} else {
/* Not recently used – move to inactive */
ClearPageActive(page);
list_add(&page->lru, &l_inactive);
}
}
/* Return pages to their proper LRU lists */
move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold,
lru - LRU_ACTIVE);
/* Free any pages whose refcount dropped to zero */
free_hot_cold_page_list(&l_hold, true);
}5.4 Handling Inactive LRU Lists
shrink_inactive_list()isolates pages from an inactive list, attempts to unmap them, writes back dirty pages, and finally frees pages whose reference count reaches zero. The function returns the number of pages reclaimed.
static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
struct lruvec *lruvec,
struct scan_control *sc,
enum lru_list lru)
{
LIST_HEAD(page_list);
unsigned long nr_taken, nr_scanned, nr_reclaimed = 0;
/* Isolate pages from the inactive list */
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);
/* Try to reclaim each page */
nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
&nr_dirty, &nr_unqueued_dirty,
&nr_congested, &nr_writeback,
&nr_immediate, false);
/* Return pages that could not be reclaimed back to the LRU */
putback_inactive_pages(lruvec, &page_list);
return nr_reclaimed;
}5.5 Page‑Level Reclamation
shrink_page_list()walks the list of isolated pages and for each page performs:
Check if the page is under writeback ( PG_writeback).
Determine whether the page was recently referenced (via the Accessed bit or PG_referenced).
If the page is anonymous and not in the swap cache, add it to the swap cache and mark it dirty.
Unmap all user mappings (respecting may_unmap).
For dirty pages, invoke page->mapping->a_ops->writepage() to start asynchronous writeback.
After successful writeback, remove the page from its address‑space radix tree and free it if page->_count == 0.
Pages that cannot be reclaimed are placed back onto their original LRU list (active or inactive) for future scans.
6. Impact on System Performance
6.1 Memory Reclamation and Responsiveness
Delayed reclamation leads to memory pressure, causing allocation failures and visible latency in user‑space applications (e.g., browsers, office suites, database queries). Timely kswapd activity keeps free pages above the low watermark, preventing such stalls.
6.2 Memory Reclamation and Stability
When free memory is exhausted, the kernel may repeatedly invoke direct reclaim and eventually the OOM killer, which can terminate critical services and crash the system. Proper tuning of watermarks and reclamation aggressiveness avoids these catastrophic scenarios.
7. Optimising the Reclamation Mechanism
7.1 Adjusting Zone Watermarks (min_free_kbytes)
Increasing min_free_kbytes raises the low‑watermark, causing kswapd to start reclamation earlier and reducing the likelihood of direct reclaim. However, setting it too high reduces usable memory for normal allocations.
7.2 Tuning Swappiness
The swappiness parameter (0‑100) controls the preference for reclaiming file pages versus anonymous pages. Low values (e.g., 10) keep pages in RAM longer, favouring performance‑critical workloads. High values (e.g., 90) aggressively swap out pages, which can be useful on systems with large swap but may increase I/O latency.
7.3 Other Useful Parameters
/proc/sys/vm/zone_reclaim_mode: Enables per‑zone reclaim and writeback. /proc/sys/vm/laptop_mode: Disables writeback during direct reclaim when non‑zero. /proc/sys/vm/overcommit_memory and /proc/sys/vm/overcommit_ratio: Influence allocation behaviour under memory pressure.
By carefully balancing these knobs, administrators can achieve a responsive and stable Linux system even under heavy memory load.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
