From xarray to Swap Table: Streamlining Linux Swapcache Management
The article analyzes how the newly introduced swap table replaces the kernel's xarray for swapcache, offering O(1) lookups, better locality, reduced lock contention, and lower memory overhead, with benchmark results showing notable performance gains in real workloads.
Background
Linux kernel page cache for files is represented by xarray, a tree‑like structure that enables O(log N) lookup of sparse hits. For large address spaces the tree avoids allocating metadata for unmapped regions, but when the total address space is small (e.g., 2 MiB) the overhead of the tree outweighs its benefits.
Swapcache is a table that maps a swap offset to a swap slot and, if the page is resident, to the corresponding folio. Historically each 64 MiB region of a swap device is described by an xarray tree, which can reach three levels deep and requires a hierarchy of locks.
Swap table design
The swap table replaces the per‑64 MiB xarray with a per‑2 MiB cluster table. Each cluster contains 512 entries of type unsigned long, each entry holding either a folio pointer, a shadow pointer for refault, or NULL. Allocation of a cluster table is performed by:
static int swap_table_alloc_table(struct swap_cluster_info *ci)
{
WARN_ON(ci->table);
ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
if (!ci->table)
return -ENOMEM;
return 0;
}On 64‑bit architectures (arm64, x86_64) the 512‑entry table occupies exactly one 4 KiB page; on 32‑bit it occupies 2 KiB.
Cluster identification and slot selection are derived from the swap offset:
static inline struct swap_cluster_info *swp_cluster(swp_entry_t entry)
{
return swp_offset_cluster(swp_info(entry), swp_offset(entry));
}
static inline unsigned int swp_cluster_offset(swp_entry_t entry)
{
return swp_offset(entry) % SWAPFILE_CLUSTER;
}Lookup performance
Because the entire cluster table resides in a single page, lookup is O(1) – a direct array index – compared with the O(log N) traversal required by xarray. The reduced indirection improves cache locality: intensive swap‑out activity confined to a single cluster accesses the same page repeatedly, lowering miss rates.
Locking and contention
The original xarray spans 64 MiB, requiring a global address‑space lock and additional per‑cluster locks. The swap table limits management to a 2 MiB cluster, so only a single cluster‑level lock is needed, eliminating the broader lock hierarchy and reducing lock contention.
Dynamic allocation and memory footprint
Patch series 8/9 introduces dynamic allocation and release: when all 512 slots in a cluster become unused, the whole table is freed. In the worst case each cluster holds exactly one used slot, so the memory overhead is one page per cluster. For a typical mobile configuration (e.g., 5 GiB zRAM) the maximum swap‑table memory is about 10 MiB.
PFN and swap‑count optimization
Patch series 24/28 proposes storing the physical frame number (PFN) and swap‑count directly in the table, replacing the per‑slot unsigned char counter. This change would further shrink memory usage and simplify the data layout.
Benchmarks
Server‑oriented workloads show noticeable throughput gains because reduced lock contention and better locality translate into higher swapcache query and modification rates. In a “build‑kernel” benchmark the swap table yields a measurable speedup (see benchmark image). On embedded devices such as smartphones the impact on user‑visible latency is smaller, as the critical path is dominated by zRAM compression speed, lock latency, and UI scheduling, but the architectural improvements remain valuable for scalability.
Swapcache usage scenarios
Memory reclamation: after a page is marked as a swap entry, a subsequent page‑fault may need to locate the resident page via swapcache.
Shared swap entries after fork: parent and child processes share the same swap entry; the first process to swap‑in populates swapcache, allowing the other to hit the cache immediately.
Swap‑in readahead: a small region around a faulting PTE is prefetched into swapcache, enabling fast hits when the actual fault occurs.
References
https://lore.kernel.org/linux-mm/[email protected]/
https://lore.kernel.org/linux-mm/[email protected]/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
