Fundamentals 11 min read

From xarray to Swap Table: Streamlining Linux Swapcache Management

The article analyzes how the newly introduced swap table replaces the kernel's xarray for swapcache, offering O(1) lookups, better locality, reduced lock contention, and lower memory overhead, with benchmark results showing notable performance gains in real workloads.

Linux Kernel Journey

Sep 6, 2025

From xarray to Swap Table: Streamlining Linux Swapcache Management

Background

Linux kernel page cache for files is represented by xarray, a tree‑like structure that enables O(log N) lookup of sparse hits. For large address spaces the tree avoids allocating metadata for unmapped regions, but when the total address space is small (e.g., 2 MiB) the overhead of the tree outweighs its benefits.

Swapcache is a table that maps a swap offset to a swap slot and, if the page is resident, to the corresponding folio. Historically each 64 MiB region of a swap device is described by an xarray tree, which can reach three levels deep and requires a hierarchy of locks.

Swap table design

The swap table replaces the per‑64 MiB xarray with a per‑2 MiB cluster table. Each cluster contains 512 entries of type unsigned long, each entry holding either a folio pointer, a shadow pointer for refault, or NULL. Allocation of a cluster table is performed by:

static int swap_table_alloc_table(struct swap_cluster_info *ci)
{
    WARN_ON(ci->table);
    ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
    if (!ci->table)
        return -ENOMEM;
    return 0;
}

On 64‑bit architectures (arm64, x86_64) the 512‑entry table occupies exactly one 4 KiB page; on 32‑bit it occupies 2 KiB.

Cluster identification and slot selection are derived from the swap offset:

static inline struct swap_cluster_info *swp_cluster(swp_entry_t entry)
{
    return swp_offset_cluster(swp_info(entry), swp_offset(entry));
}

static inline unsigned int swp_cluster_offset(swp_entry_t entry)
{
    return swp_offset(entry) % SWAPFILE_CLUSTER;
}

Lookup performance

Because the entire cluster table resides in a single page, lookup is O(1) – a direct array index – compared with the O(log N) traversal required by xarray. The reduced indirection improves cache locality: intensive swap‑out activity confined to a single cluster accesses the same page repeatedly, lowering miss rates.

Locking and contention

The original xarray spans 64 MiB, requiring a global address‑space lock and additional per‑cluster locks. The swap table limits management to a 2 MiB cluster, so only a single cluster‑level lock is needed, eliminating the broader lock hierarchy and reducing lock contention.

Dynamic allocation and memory footprint

Patch series 8/9 introduces dynamic allocation and release: when all 512 slots in a cluster become unused, the whole table is freed. In the worst case each cluster holds exactly one used slot, so the memory overhead is one page per cluster. For a typical mobile configuration (e.g., 5 GiB zRAM) the maximum swap‑table memory is about 10 MiB.

PFN and swap‑count optimization

Patch series 24/28 proposes storing the physical frame number (PFN) and swap‑count directly in the table, replacing the per‑slot unsigned char counter. This change would further shrink memory usage and simplify the data layout.

Benchmarks

Server‑oriented workloads show noticeable throughput gains because reduced lock contention and better locality translate into higher swapcache query and modification rates. In a “build‑kernel” benchmark the swap table yields a measurable speedup (see benchmark image). On embedded devices such as smartphones the impact on user‑visible latency is smaller, as the critical path is dominated by zRAM compression speed, lock latency, and UI scheduling, but the architectural improvements remain valuable for scalability.

Swapcache usage scenarios

Memory reclamation: after a page is marked as a swap entry, a subsequent page‑fault may need to locate the resident page via swapcache.

Shared swap entries after fork: parent and child processes share the same swap entry; the first process to swap‑in populates swapcache, allowing the other to hit the cache immediately.

Swap‑in readahead: a small region around a faulting PTE is prefetched into swapcache, enabling fast hits when the actual fault occurs.

References

https://lore.kernel.org/linux-mm/[email protected]/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

memory management Linux kernel xarray swap table swapcache

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.