How Modernizing Linux Swap Boosts Performance and Cuts Memory Overhead
This article translates and consolidates Jonathan Corbet’s three-part “Modernizing swapping” series, detailing the introduction of swap tables, removal of swap maps, and virtual swap concepts that together improve Linux kernel swap performance by up to 20%, reduce metadata memory by up to 30%, and simplify the codebase.
Background
The Linux kernel swap subsystem has become notoriously complex over decades, accumulating technical debt that hampers performance and maintainability. Starting at the 2025 Linux Storage, Filesystem, Memory Management & BPF summit, TencentOS kernel engineer Kairui Song led a series of systematic refactorings that introduced a swap table, removed the swap map, and explored virtual swap concepts.
1. Introducing the Swap Table
Swap handles anonymous pages that lack a natural backing store by writing them to persistent storage. Historically, swap entries were indexed via struct swap_info_struct and stored in an XArray‑based swapper_spaces structure, leading to fragmented metadata and lock contention.
The new swap table replaces the XArray with a simple, cache‑friendly array (often fitting in a single physical page). Each entry encodes the state of a swap slot: empty, occupied by a resident folio, or holding a shadow entry for a swapped‑out page. This redesign eliminates the need for the separate swapper_spaces and reduces metadata memory by roughly 30% (≈256 MiB saved for a 1 TiB swap file).
typedef struct { unsigned long val; } swp_entry_t;By consolidating swap metadata, the kernel gains 5‑20% throughput improvements in benchmarks and real‑world workloads, primarily due to the removal of costly XArray lookups and reduced lock contention.
2. Removing the Swap Map
The swap map is a byte‑array tracking reference counts for each swap slot. Special bits (e.g., SWAP_HAS_CACHE) are used for synchronization and to indicate whether a slot’s page still resides in RAM. This design introduces complexity, especially for fast devices like ZRAM, where bypassing the swap cache can be beneficial.
By moving reference‑count tracking into the swap table entries themselves, the swap map can be eliminated. The new 64‑bit (or 32‑bit on 32‑bit arches) entry stores the count directly, and overflow handling allocates an auxiliary array per cluster. This change cuts swap‑map‑related metadata by about 30% and simplifies the code path for swap‑in and swap‑out operations.
3. Virtual Swap Space
Both Meta and TencentOS have proposed virtual swap layers. Meta’s design introduces a swp_desc structure that abstracts the physical slot, allowing a unified virtual address space independent of underlying devices. TencentOS’s “GhostSwap” builds on Google’s GhostSwap concept, using a similar descriptor to support dynamic, on‑demand swap allocation without pre‑allocating static slots.
struct swp_desc { union { swp_slot_t slot; struct zswap_entry *zswap_entry; }; union { struct folio *swap_cache; void *shadow; }; unsigned int swap_count; unsigned short memcgid:16; bool in_swapcache:1; enum swap_type type:2; };This virtual layer enables pages to migrate between devices without updating every PTE, reduces wasted space for ZRAM, and allows a single swp_desc to represent zero‑filled pages, real swap slots, or ZRAM entries. The trade‑off is increased memory usage (up to four‑fold per entry) and added complexity, which has shown mixed performance results in early benchmarks.
4. Additional Reforms
Other notable changes include the removal of the static swap_cgroup array (saving ~512 MiB for a 1 TiB device) and the unification of swap‑cluster management, eliminating per‑CPU cluster locks in favor of a single, scalable structure.
Community contributions such as Youngjun Park’s “swap tiers” patch set propose tiered swap devices, allowing high‑performance media to serve latency‑sensitive workloads while slower devices handle bulk swapping. This complements the swap‑table evolution and may converge in future kernel releases.
5. Outlook
Beyond swap modernization, the TencentOS team continues to improve memory management, including MGLRU enhancements that deliver up to 30% performance gains on HDD‑bound workloads and significantly reduce OOM incidents. These patches are pending review for inclusion in the mainline kernel.
Tencent Architect
We share insights on storage, computing, networking and explore leading industry technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
