Operations 27 min read

How TencentOS Engineers Revamped Linux Swap for 5‑20% Performance Gains

This article translates and consolidates three LWN analyses of the Linux swap subsystem modernization led by TencentOS kernel engineer Kairui Song, detailing the introduction of swap tables, removal of the swap map, virtual swap concepts, code changes, performance improvements of up to 20 % and the broader impact on the kernel community.

Tencent Technical Engineering

Apr 12, 2026

How TencentOS Engineers Revamped Linux Swap for 5‑20% Performance Gains

The Linux kernel’s swap subsystem has long been a complex and performance‑critical part of memory management. Over the past 18 months, TencentOS kernel engineer Kairui Song and collaborators have undertaken a systematic redesign that introduces a new swap table, removes the legacy swap map, and explores virtual swap concepts.

Background

Traditional swap handling relied on the XArray‑based swapper_spaces structure and a per‑slot swap_map byte array, leading to high lock contention and memory overhead. The first phase, merged into Linux 6.18, replaced the XArray with a swap table and introduced swap_cluster_info to improve locality and reduce metadata usage.

1. Introducing the Swap Table

Each swap slot is now represented by a 64‑bit (or 32‑bit on 32‑bit arches) entry in a dynamically allocated array:

typedef struct { unsigned long val; } swp_entry_t;

The array pointer is stored in the new table field: atomic_long_t __rcu *table; This design eliminates the need for the XArray lookup, reduces per‑slot memory from ~30 % to a few bytes, and allows the kernel to allocate the array lazily – only when a cluster is actually used.

Swap Entry Layout

Bits in the entry encode the slot state:

0 – empty slot (NULL)

1 – shadow entry for a swapped‑out folio (high bits store reference count)

2 – resident folio (high bits store PFN)

3 – unused pointer entry

4 – bad slot marker

By moving reference‑count tracking into the table, the separate swap map can be removed entirely.

2. Removing the Swap Map

The legacy swap_map was a unsigned char * array storing per‑slot usage counts and special bits such as SWAP_HAS_CACHE (0x40). Its removal eliminates the extra byte‑per‑slot overhead and the complex bit‑lock logic used for swap‑in synchronization.

Performance measurements reported by the author show throughput and RPS improvements of roughly 5‑20 % after the first phase, mainly due to the elimination of XArray lookups and reduced lock contention.

3. Virtual Swap and GhostSwap

Beyond the swap table, the community is exploring virtual swap layers. Meta proposes a virtual swap space that abstracts physical devices, while TencentOS introduced a Virtual GhostSwap implementation based on Google’s GhostSwap idea. Both approaches use a unified swp_desc structure:

struct swp_desc {
    union {
        swp_slot_t slot;
        struct zswap_entry *zswap_entry;
    };
    union {
        struct folio *swap_cache;
        void *shadow;
    };
    unsigned int swap_count;
    unsigned short memcgid:16;
    bool in_swapcache:1;
    enum swap_type type:2;
};

This structure can represent a real device slot, a zero‑filled page, a zswap entry, or a resident folio, enabling flexible migration between devices and eliminating the need to scan page tables when a swap device is removed.

Design Trade‑offs

The virtual‑swap design increases per‑entry memory from 8 bytes to up to 32 bytes and adds complexity, but it simplifies device removal and supports tiered swap configurations (e.g., fast NVMe tier + slower HDD tier) as proposed by Youngjun Park.

4. Community Impact

The patches have been reviewed on the Linux‑MM mailing list, with contributions from Google’s Chris Li and Meta’s Pham. Discussions cover performance regressions, memory‑usage concerns, and the interaction with existing subsystems such as zswap and swap‑cgroup. The work also paves the way for future extensions, including integrating memory‑controller limits into the swap table.

5. Future Outlook

Stage 3 of the swap‑table project aims to eliminate the remaining swap‑map responsibilities entirely. Additional work on MGLRU page‑reclaim logic shows up to 30 % performance gains on HDD‑bound workloads and significant OOM reductions. All patches are awaiting final integration into the mainline kernel.