Why Page Cache Is the Hidden Engine Behind Linux I/O Performance
The article explains how Linux’s page cache bridges memory and disk, detailing its read/write mechanisms, dirty page handling, pre‑read optimization, kernel parameters, and practical tuning tips for static file serving, databases, and logging, showing why mastering it is essential for performance.
Page Cache Overview
Page cache is the kernel‑managed memory region that mirrors file data on disk. Because RAM (nanoseconds) is orders of magnitude faster than block devices (milliseconds), the kernel checks page cache first; a hit returns data instantly, a miss triggers a disk read.
Core functions
Accelerate reads – hot files (static web assets, configuration files) are served from memory with near‑zero latency.
Merge writes – multiple small write() calls are coalesced into larger block writes, reducing I/O calls and extending SSD lifespan.
Zero‑copy – system calls such as sendfile() can transfer data directly from page cache to a socket without copying through user space, a technique used by high‑throughput services like Kafka.
Read Path
Cache hit
When a process calls read(), the kernel looks up the inode and offset in the page‑cache radix tree. If the page is present, the kernel copies it to user space via copy_to_user() and returns, bypassing the block device entirely.
Cache miss
If the page is absent, a page fault occurs. The VFS invokes the filesystem’s readpage() or readpages(), which translate the request into block‑device I/O. The I/O scheduler (e.g., CFQ, Deadline) orders the request, the block driver issues a DMA transfer, and the data lands in a newly allocated 4 KB page. The page is then inserted into the radix tree for fast future lookup. The kernel also performs read‑ahead: after loading the requested page it pre‑fetches the next 16 KB–32 KB of sequential pages.
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
int main(){
char buf[1024];
int fd = open("example.txt", O_RDONLY);
if(fd<0){ perror("open"); return 1; }
ssize_t n = read(fd, buf, sizeof(buf));
if(n<0) perror("read");
else printf("Read %zd bytes
", n);
close(fd);
return 0;
}Write Path & Dirty‑Page Lifecycle
Write system call
write()copies data from user buffers into page cache via copy_from_user(). The affected pages are marked PG_dirty because their contents differ from the on‑disk image.
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
int main(){
const char *msg = "Hello, page cache!";
int fd = open("test.txt", O_WRONLY|O_CREAT, 0644);
if(fd<0){ perror("open"); return 1; }
write(fd, msg, strlen(msg));
close(fd);
return 0;
}Dirty‑page write‑back mechanisms
Timed flush – a kernel thread wakes every vm.dirty_writeback_centisecs (default 0.5 s) and writes a batch of dirty pages.
Ratio‑based flush – when dirty pages exceed vm.dirty_background_ratio (default 10 %) a background thread starts flushing; if they exceed vm.dirty_ratio (default 20 %) new writes block until the ratio drops.
Explicit sync – applications call fsync() or fdatasync() to force immediate write‑back.
// Force write‑back of a single file
int fd = open("log.txt", O_WRONLY|O_CREAT, 0644);
write(fd, "data", 4); // creates a dirty page
fsync(fd); // explicit flush
close(fd);Key Tunable Parameters
vm.pagecache_limit_mb– hard limit for page‑cache memory (MB). vm.swappiness – 0‑100; low values keep page cache, high values favor swapping. vm.dirty_ratio and vm.dirty_background_ratio – thresholds that trigger background or blocking write‑back. vm.dirty_writeback_centisecs – interval for the write‑back thread (centiseconds). /sys/block/<dev>/queue/read_ahead_kb – per‑device read‑ahead size (KB).
# View and adjust page‑cache limit
sysctl vm.pagecache_limit_mb
sysctl -w vm.pagecache_limit_mb=4096
# Reduce swap pressure for I/O‑heavy workloads
sysctl -w vm.swappiness=10
# Tune dirty‑page thresholds
sysctl -w vm.dirty_background_ratio=10
sysctl -w vm.dirty_ratio=20
# Change read‑ahead for /dev/sda (large sequential files)
echo 512 > /sys/block/sda/queue/read_ahead_kbCache Management Operations
For testing or emergency memory reclamation, the kernel exposes /proc/sys/vm/drop_caches: echo 1 > /proc/sys/vm/drop_caches – drop only page cache. echo 2 > /proc/sys/vm/drop_caches – drop dentries and inodes. echo 3 > /proc/sys/vm/drop_caches – drop all three.
These commands free RAM instantly but cause a burst of disk I/O as subsequent reads repopulate the cache; they should never be scheduled automatically in production.
Typical Workloads
Static file servers (Nginx, Apache) – an image file read once loads into page cache; millions of later requests hit the cache, keeping latency near zero.
Databases (MySQL, PostgreSQL) – table and index files are cached by the kernel. Even if the DB’s own buffer pool misses, a page‑cache hit avoids a disk read.
Logging – log lines are first written to page cache and flushed asynchronously. Critical logs should be followed by fsync() to guarantee durability.
Performance Trade‑offs & Tuning Guidance
Page‑cache size vs. application memory – on a 16 GB server reserving 4 GB for cache ( vm.pagecache_limit_mb=4096) leaves enough RAM for Java or DB processes while still achieving high hit rates.
Swappiness – set to 10–20 for I/O‑bound services (static content, log aggregation) to keep cache resident; set to 60–80 for compute‑bound workloads that need more anonymous memory.
Read‑ahead – large sequential workloads (video, backup) benefit from 256 KB–1 MB read‑ahead; small random files (configs, short logs) should use 64 KB or less to avoid cache pollution.
Direct I/O (O_DIRECT) – bypasses page cache, useful for bulk backups or DB bulk loads where cache contamination would hurt other workloads. Requires page‑aligned buffers and block‑size alignment.
Explicit sync – use fsync() for transaction logs or any data that must survive a crash; use fdatasync() when metadata updates are unnecessary to reduce overhead.
Summary of Critical Commands
# Check current tunables
sysctl vm.swappiness
sysctl vm.dirty_ratio
cat /sys/block/sda/queue/read_ahead_kb
# Adjust for a write‑intensive DB
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_writeback_centisecs=1000 # 10 s interval
# Drop caches (root only, for testing)
echo 3 > /proc/sys/vm/drop_cachesSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
