Deep Dive into Linux Kernel Page Cache Xarray: Problem, Analysis, and Optimizations
The article examines a long‑standing hidden bug in the Linux kernel’s page‑cache Xarray that caused occasional data loss with Large Folio support, details its discovery and fix by the TencentOS team, and shows how consolidating multiple tree walks into a single walk in Linux 6.10 reduced latency and improved performance by about ten percent.
This article analyzes a long‑standing hidden bug in the Linux kernel Page Cache implementation that involves the Xarray data structure. The bug caused occasional data loss when Large Folio support was enabled in XFS and other filesystems. The TencentOS kernel team discovered and fixed the issue months before the community became aware of it, and their patches were later accepted into the mainline kernel and LTS releases.
The discussion begins with an overview of Xarray, which is a flexible, radix‑tree‑based array used by the kernel to manage Page Cache entries. Xarray stores Entry values indexed by a 64‑bit Index . Basic APIs such as xa_init , xa_store , and xa_load are introduced, followed by a description of the internal node layout ( struct xa_node ) and how indices are split into 6‑bit offsets to traverse the tree.
Key concepts covered include:
Multi‑Index entries that allow a single node to represent a contiguous range of pages, reducing tree depth and memory usage.
How Page Cache uses Xarray to map file offsets to either a cached folio (pointer) or a shadow entry (value) that records eviction information.
The article then presents a simplified version of the original Page Cache insertion function ( __filemap_add_folio ) from Linux 6.9 and earlier, highlighting five separate tree walks that occur during a single insertion. The code is shown below:
int __filemap_add_folio(struct address_space *mapping,
struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
{
XA_STATE(xas, &mapping->i_pages, index);
xas_set_order(&xas, index, folio_order(folio));
do {
unsigned int order = xa_get_order(xas.xa, xas.xa_index);
void *entry, *old = NULL;
if (order > folio_order(folio))
xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), order, gfp);
xas_lock_irq(&xas);
xas_for_each_conflict(&xas, entry) {
old = entry;
if (!xa_is_value(entry)) {
xas_set_err(&xas, -EEXIST);
goto unlock;
}
}
if (old) {
if (shadowp)
*shadowp = old;
order = xa_get_order(xas.xa, xas.xa_index);
if (order > folio_order(folio)) {
xas_split(&xas, old, order);
xas_reset(&xas);
}
}
xas_store(&xas, folio);
if (xas_error(&xas))
goto unlock;
unlock:
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp));
if (xas_error(&xas))
return xas_error(&xas);
return 0;
error:
return xas_error(&xas);
}The redundant walks increase latency, especially for large files. The new implementation, merged in Linux 6.10, consolidates the walks, performs the split allocation outside the lock only when needed, and validates that any previously allocated node is still valid before reuse. The optimized function is shown below:
int __filemap_add_folio(struct address_space *mapping,
struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
{
XA_STATE(xas, &mapping->i_pages, index);
void *alloced_shadow = NULL;
int alloced_order = 0;
xas_set_order(&xas, index, folio_order(folio));
for (;;) {
int order = -1, split_order = 0;
void *entry, *old = NULL;
xas_lock_irq(&xas);
xas_for_each_conflict(&xas, entry) {
old = entry;
if (!xa_is_value(entry)) {
xas_set_err(&xas, -EEXIST);
goto unlock;
}
if (order == -1)
order = xas_get_order(&xas);
}
if (alloced_order && (old != alloced_shadow || order != alloced_order)) {
xas_destroy(&xas);
alloced_order = 0;
}
if (old) {
if (order > 0 && order > folio_order(folio)) {
if (!alloced_order) {
split_order = order;
goto unlock;
}
xas_split(&xas, old, order);
xas_reset(&xas);
}
if (shadowp)
*shadowp = old;
}
xas_store(&xas, folio);
if (xas_error(&xas))
goto unlock;
unlock:
xas_unlock_irq(&xas);
if (split_order) {
xas_split_alloc(&xas, old, split_order, gfp);
if (xas_error(&xas))
goto error;
alloced_shadow = old;
alloced_order = split_order;
xas_reset(&xas);
continue;
}
if (!xas_nomem(&xas, gfp))
break;
}
if (xas_error(&xas))
goto error;
return 0;
error:
return xas_error(&xas);
}The new logic reduces the number of tree walks from up to five to a single walk in the common case, yielding roughly a 10 % performance improvement in fio benchmarks. The root cause of the original bug was identified as the use of xas_split_alloc outside the lock, which could allocate stale nodes that later interfered with xas_store , leading to corrupted Xarray state.
Community discussion, Linus Torvalds’ endorsement, and the subsequent inclusion of the patches into 6.10 and LTS kernels demonstrate the collaborative nature of kernel development. The article concludes with a brief note on the broader impact of the fix and invites readers to explore more kernel internals.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.