How Linux Kernel Readahead Boosts File I/O Performance: Design & Implementation Explained
Through detailed code examples and step‑by‑step analysis, this article explains how the Linux 3.12 kernel implements file system readahead, showing how synchronous and asynchronous prefetch windows are initialized, adjusted, and used to improve sequential read performance.
Overview
This article explains the design and implementation of the Linux kernel (3.12) file system readahead mechanism.
Readahead means the file system reads more data than requested and caches it in the page cache, so subsequent reads can be served faster. The article analyzes readahead logic through several scenarios (sequential read, random read, multithreaded interleaved read).
Scenario 1: Sequential Read
// example code
{
...
f = open("file", ...);
ret = read(f, buf, 4096);
ret = read(f, buf, 2 * 4096);
ret = read(f, buf, 4 * 4096);
...
}This scenario is simple: open a file and perform three sequential reads; we examine how the OS performs readahead.
Read 1
On the first read, the kernel looks for the page in the page cache; a cache miss triggers a synchronous readahead.
static void do_generic_file_read(struct file *filp, loff_t *ppos,
read_descriptor_t *desc, read_actor_t actor)
{
...
for (;;) {
...
cond_resched();
find_page:
// if not found, start synchronous readahead
page = find_get_page(mapping, index);
if (!page) {
page_cache_sync_readahead(mapping, ra, filp,
index, last_index - index);
}
...
}
...
}The synchronous readahead logic eventually calls the following function:
// Note: offset and req_size are page counts
static unsigned long ondemand_readahead(
struct address_space *mapping,
struct file_ra_state *ra,
struct file *filp,
bool hit_readahead_marker,
pgoff_t offset,
unsigned long req_size)
{
unsigned long max = max_sane_readahead(ra->ra_pages);
// First read: initialize readahead window
if (!offset)
goto initial_readahead;
...
initial_readahead:
ra->start = offset;
ra->size = get_init_ra_size(req_size, max);
// ra->size >= req_size guaranteed
ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
...
readit:
/* Will this read hit the readahead marker made by itself?
* If so, trigger the marker now and merge the next window.
*/
if (offset == ra->start && ra->size == ra->async_size) {
ra->async_size = get_next_ra_size(ra, max);
ra->size += ra->async_size;
}
return ra_submit(ra, mapping, filp);
}The read logic initializes a readahead window (ra->start, ra->size, ra->async_size). In this example the window is (0,4,3). After submitting the read request, the kernel reads pages 0‑3, where pages 1‑3 are pre‑read and page 1 is marked PAGE_READAHEAD for asynchronous readahead.
(ra->start, ra->size, ra->async_size) = (0,4,3)
When the application reads page 0, the kernel has already fetched pages 0‑3; the application copies data from page 0.
Read 2
The second read requests offset=4096, size=8192, which corresponds to pages 1 and 2. Because of the previous readahead, pages 1 and 2 are already in memory, but page 1 carries the PAGE_READAHEAD flag, triggering an asynchronous readahead.
find_page:
...
page = find_get_page(mapping, index);
if (!page) {
page_cache_sync_readahead(mapping, ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
}
if (PageReadahead(page)) {
page_cache_async_readahead(mapping, ra, filp,
page, index, last_index - index);
}
...
static unsigned long ondemand_readahead(
struct address_space *mapping,
struct file_ra_state *ra,
struct file *filp,
bool hit_readahead_marker,
pgoff_t offset,
unsigned long req_size)
{
unsigned long max = max_sane_readahead(ra->ra_pages);
/* If:
* 1. sequential read (current offset = previous ra->start + ra->size - ra->async_size)
* 2. offset == (ra->start + ra->size)
*/
if ((offset == (ra->start + ra->size - ra->async_size) ||
offset == (ra->start + ra->size))) {
// advance window
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
goto readit;
}
...
}After this read, the readahead window becomes (0,4,3) → (4,8,8). The asynchronous readahead adds page 4 with the PAGE_READAHEAD flag.
At this point the page cache contains pages 0‑7 as shown.
Read 3
The third sequential read requests pages 3‑6. Because page 4 is still marked PAGE_READAHEAD, accessing it triggers another asynchronous readahead, expanding the window to (12,16,16).
After the three sequential reads and the kernel's readahead behavior, the page cache state is illustrated below.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
