Memory Compaction in the Linux Kernel: Mechanisms, Strategies, and Implementation Details
Linux’s memory compaction mitigates external fragmentation by moving movable pages, employing four strategies—direct, passive (kcompactd), proactive, and active—each invoking the compact_zone core with configurable compact_control parameters, migrate‑page and free‑page scanners, and distinct trigger and exit conditions.
1. Introduction
The buddy allocator is the kernel's basic physical page allocator. It is efficient and simple, but it cannot completely eliminate external fragmentation. When fragmentation prevents allocating a contiguous block of pages (e.g., >=4 pages), the system must resort to memory compaction.
Compaction works by moving movable pages to create larger free areas. The kernel provides a generic migrate_pages interface; memory compaction is just one of its applications.
2. Memory Compaction Scenarios
The kernel defines four compaction strategies:
Direct compaction
Passive compaction (kcompactd)
Proactive compaction
Active compaction (user‑triggered)
Each strategy ultimately invokes compact_zone but differs in trigger conditions, scope, and intensity.
2.1 Direct Compaction
2.1.1 Trigger Conditions
When the buddy allocator fails to satisfy a request, the slow‑path __alloc_pages_slowpath is entered. If the allocation flag permits direct compaction, the kernel calls __alloc_pages_direct_compact to attempt a first compaction. If it still fails, the allocator may loop through kswapd wake‑up, further retries, and possibly OOM.
2.1.2 Logic Overview
The entry point __alloc_pages_direct_compact invokes try_to_compact_pages , which calls the compaction core. The core scans zones, checks compaction_deferred for delayed compaction, and then runs compact_zone_order .
2.1.2.1 Delayed Compaction
The function compaction_deferred decides whether to postpone compaction based on two mechanisms: (A) the highest failed order for the zone, and (B) a counter compact_considered that tracks how many times compaction has been deferred. Successful compaction resets these counters; failures increase them, making future compaction attempts more conservative.
2.1.2.2 capture_control
During compaction, the kernel may capture freshly freed pages via compaction_capture . If a freed page matches the target order and is of type MIGRATE_MOVABLE , it can be “snatched” and used immediately, reducing the need for a separate migration step.
2.1.3 Characteristics
Target order is fixed, limiting the compaction range.
Both migrate‑page and free‑page scanners use fast scanning.
Watermark checks are performed based on the highest zone index.
Priority starts as COMPACT_PRIO_ASYNC and may increase, turning async compaction into sync when retries occur.
2.2 Passive Compaction (kcompactd)
During boot, kcompactd_init creates a kernel thread per NUMA node. The thread sleeps until it is woken up by wakeup_kcompactd , which is triggered by several events, such as repeated kswapd failures, boosted watermarks, or explicit sysctl settings.
kcompactd evaluates each zone with the following checks before invoking compact_zone :
Zone contains valid pages.
Requested order is not lower than the zone’s recorded failed order.
Delay counters have not exceeded thresholds.
Current watermarks already satisfy the request.
Fragmentation index is below sysctl_extfrag_threshold for orders > 3.
When kswapd finishes a reclamation cycle, it calls wakeup_kcompactd to give the system a chance to compact before further allocations.
2.3 Proactive Compaction
Proactive compaction aims to reduce large‑page allocation latency by periodically evaluating node‑wide fragmentation for the huge‑page order (usually order 9). If the fragmentation score exceeds a configurable threshold (controlled via vm.compaction_proactiveness ), the kernel runs a compaction pass.
The evaluation sums per‑zone scores weighted by each zone’s size. When the node score is high, the kernel performs compaction; otherwise it backs off and may increase the wake‑up interval.
2.4 Active Compaction
Users can trigger compaction manually through the compact_memory sysfs node (global) or the per‑node compact node on NUMA systems. This is the most heavyweight operation because it attempts to compact all zones or a specific node completely.
Active compaction uses a compact_control configuration with no target order, synchronous migration mode, and full‑zone scanning.
3. Memory Compaction Core
3.1 Parameter Structure ( compact_control )
The structure controls many aspects of compaction, such as target order, zone limits, migration mode, and whether fast scanners are enabled. Different scenarios fill the fields differently, which explains the behavioral differences observed in the previous sections.
3.2 Migrate‑Page Scanner
The function isolate_migratepages walks a zone from low to high addresses, selecting a suitable pageblock. It may use a fast‑path ( fast_find_migrateblock ) that looks for a pageblock with many free pages of the desired order, setting a skip hint to avoid re‑scanning.
For each candidate pageblock, suitable_migration_source checks whether the block can be isolated, and isolate_migratepages_block actually isolates movable pages, non‑LRU pages, or pages marked PG_movable . Pages that are pinned, huge pages (unless alloc_contig is true), or free pages are skipped.
3.3 Free‑Page Scanner
The counterpart isolate_freepages walks from high to low addresses, isolating free pages for use as migration destinations. It also has a fast path ( fast_isolate_freepages ) that searches the appropriate free‑list order for a suitable block, preferring pages in the “green” region of the zone to avoid early scanner meeting.
When a suitable free page is found, __isolate_free_page removes it from the buddy allocator and adds it to compact_control->freepages .
3.4 Compaction Exit Conditions
The function compact_finished decides when a compaction pass ends. Main conditions are:
Scanner meeting (migrate and free scanners cross).
Proactive compaction: fragmentation score falls below the water‑mark.
Direct compaction: the original allocation request is satisfied and migration type matches the request.
Return codes include COMPACT_CONTINUE , COMPACT_COMPLETE , COMPACT_PARTIAL_SKIPPED , COMPACT_SUCCESS , and COMPACT_CONTENDED , each indicating a different outcome.
4. Summary and Statistics
Memory compaction is a heavy‑weight fragmentation mitigation technique. The kernel provides four entry points (direct, passive, proactive, active) that differ in trigger, scope, and aggressiveness. All of them share the same core logic driven by compact_control , migrate‑page scanner, free‑page scanner, and exit‑condition checks.
Statistics about compaction activity can be read from /proc/vmstat , and tunables are exposed via various /sys/kernel/debug/extfrag/ and /sys/vm/ nodes.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.