From xarray to Swap Table: Redesigning Linux Swapcache for Speed and Simplicity
The article analyzes the new swap table implementation that replaces the kernel's xarray‑based swapcache with a per‑cluster array, offering O(1) lookups, reduced lock contention, lower memory overhead, and measurable performance gains demonstrated by real‑world benchmarks.
Recently, a swap table co‑developed by Chris Li (Google) and Kairui Song (Tencent) has generated significant interest in the Linux kernel community as a simpler, higher‑performance replacement for the existing xarray‑based swapcache.
In the kernel, xarray is used to record page‑cache hits for a file. It stores only the sparse set of cached pages in a tree‑like structure, allowing O(logN) lookup similar to a multi‑level page table. While efficient for large sparse ranges, the tree overhead becomes wasteful when the managed range is as small as 2 MiB.
Swapcache mirrors the file page‑cache mechanism for anonymous pages: given an offset it returns the corresponding swap slot and, if present, the in‑memory page (or “folio”). The typical usage scenarios are listed in Appendix 1.
The swap table redesign partitions a swap device (e.g., a 2 GiB partition) into 2 MiB clusters. Each cluster contains 512 slots, each slot holding an unsigned long that can store a folio pointer, a shadow pointer for refault, or NULL. On 64‑bit architectures a cluster occupies exactly 4 KiB; on 32‑bit it occupies 2 KiB. The allocation routine is:
static int swap_table_alloc_table(struct swap_cluster_info *ci)
{
WARN_ON(ci->table);
ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
if (!ci->table)
return -ENOMEM;
return 0;
}Lookup uses the high bits of the swap offset to select the cluster and the low bits to index one of the 512 slots:
static inline struct swap_cluster_info *swp_cluster(swp_entry_t entry)
{
return swp_offset_cluster(swp_info(entry), swp_offset(entry));
}
static inline unsigned int swp_cluster_offset(swp_entry_t entry)
{
return swp_offset(entry) % SWAPFILE_CLUSTER;
}This yields O(1) access time, compared with the original O(logN) of xarray, and eliminates the need for the deep tree lock hierarchy. Because a cluster’s entire table resides in a single page, intensive swap‑out activity within one cluster enjoys strong locality, reducing cache misses relative to the slab‑based xarray implementation.
Lock granularity is also improved: the original xarray spans a 64 MiB address space per tree, requiring a global lock for the whole tree, whereas the swap table locks only the 2 MiB cluster being accessed.
Memory overhead is modest. In the worst case each cluster retains one unused slot, but even a device with five 5 GiB zRAM partitions would consume only about 10 MiB of swap‑table memory. A dynamic‑allocation patch (see PATCH 8/9) releases a cluster’s table when all its slots are free, further shrinking the footprint.
Future work includes moving the folio pointer to store a PFN and a swap‑count field directly in the table (PATCH 24/28), which could reduce the size of the per‑slot metadata.
Benchmarks show noticeable speedups. In a kernel‑build workload the swap table reduced swapcache query and modification latency, as illustrated in the following chart:
For embedded systems such as smartphones, the impact on UI latency or zRAM compression rate is limited, but the infrastructure improvement is valuable for server‑side workloads where reduced lock contention and better cache locality translate into higher throughput for critical services.
Appendix 1 – Swapcache usage scenarios
During memory reclamation, before a page is truly freed, a page fault may hit the swapcache to retrieve the page for mapping into a process’s PTE.
When multiple processes share the same swap entry (e.g., after fork), the first process to swap‑in populates the swapcache, allowing subsequent processes to map the same page without additional I/O.
Swap‑in readahead can preload a small region around a faulting PTE into the swapcache, enabling fast subsequent faults to hit the cache.
References
【1】https://lore.kernel.org/linux-mm/[email protected]/
【2】https://lore.kernel.org/linux-mm/[email protected]/
Linux Code Review Hub
A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
