Fundamentals 86 min read

Unveiling JDK NIO File IO: How Linux Page Cache and Kernel IO Accelerate Reads

This article deeply explores JDK NIO's file reading and writing mechanisms, revealing how Buffered and Direct IO interact with Linux's page cache, the kernel's radix tree, prefetch algorithms, and dirty page management, while providing practical code examples and performance insights for developers.

Bin's Tech Cabin
Bin's Tech Cabin
Bin's Tech Cabin
Unveiling JDK NIO File IO: How Linux Page Cache and Kernel IO Accelerate Reads

1. Introduction

The article builds on previous posts about Linux kernel IO models and JDK NIO ByteBuffer design, linking socket file structures in the kernel with Java's NIO APIs. It aims to connect kernel‑level file operations with high‑level JDK NIO usage.

2. JDK NIO Reading an Ordinary File

Using FileChannel and a heap ByteBuffer, JDK reads a file as follows:

FileChannel fileChannel = new RandomAccessFile(new File("file-read-write.txt"), "rw").getChannel();
ByteBuffer heapByteBuffer = ByteBuffer.allocate(4096);
fileChannel.read(heapByteBuffer);

The FileChannelImpl class holds a FileDescriptor and a NativeDispatcher. Its read method ultimately calls IOUtil.read, which creates a temporary DirectByteBuffer and invokes the native method read0 in FileDispatcherImpl.c:

JNIEXPORT jint JNICALL Java_sun_nio_ch_FileDispatcherImpl_read0(JNIEnv *env, jclass clazz,
    jobject fdo, jlong address, jint len) {
    jint fd = fdval(env, fdo);
    void *buf = (void *)jlong_to_ptr(address);
    return convertReturnVal(env, read(fd, buf, len), JNI_TRUE);
}

In the kernel the system call is defined as:

SYSCALL_DEFINE3(read, unsigned int fd, char __user *buf, size_t count) {
    struct fd f = fdget_pos(fd);
    loff_t pos = file_pos_read(f.file);
    ret = vfs_read(f.file, buf, count, &pos);
    return ret;
}

The vfs_read function dispatches to the file system via file_operations. For ext4 the relevant pointer is ext4_file_read_iter, which eventually calls generic_file_read_iter. This function decides between Buffered IO and Direct IO based on the IOCB_DIRECT flag.

3. Page Cache (Page Cache) Fundamentals

Linux caches file data in the page cache, represented by struct address_space. Each file has a single address_space shared among all processes that open it. The cache stores pages in a radix tree ( struct radix_tree_root), enabling fast lookup by page index.

struct address_space {
    struct inode *host;
    struct radix_tree_root page_tree;
    unsigned long nrpages;
    const struct address_space_operations *a_ops;
};

Pages are described by struct page, which contains an index (the page number) and flags indicating state such as PG_dirty or PG_writeback.

struct page {
    unsigned long flags;
    struct address_space *mapping;
    unsigned long index;
};

The radix tree nodes ( struct radix_tree_node) hold 64 pointers ( slots) to child nodes or leaf pages. Tags in the node ( tags[][]) allow fast queries for pages with specific flags (e.g., dirty pages).

4. Page Cache Lookup

To locate a page, the kernel uses the page index (a pgoff_t value). The function find_get_page is a thin wrapper around pagecache_get_page, which walks the radix tree using the low‑order bits of the index.

static inline struct page *find_get_page(struct address_space *mapping, pgoff_t offset) {
    return pagecache_get_page(mapping, offset, 0, 0);
}

If the page is not present, pagecache_get_page allocates a new page, inserts it into the radix tree, and returns it.

5. Buffered IO vs Direct IO

Buffered IO reads/writes go through the page cache. A read copies data from the cache to a user‑space buffer (one copy). A write copies data from user space into the cache (second copy) and marks the page dirty; the kernel later writes dirty pages back to disk.

Direct IO bypasses the page cache. The kernel copies data directly between the user‑space buffer and the disk using DMA, eliminating the extra copy but requiring page‑aligned buffers and I/O sizes.

6. Direct IO in Java

Since JDK 10, Direct IO can be requested with ExtendedOpenOption.DIRECT:

Path p = Paths.get("file-read-write.txt");
FileChannel fc = FileChannel.open(p, StandardOpenOption.WRITE, ExtendedOpenOption.DIRECT);

When O_DIRECT is set, the kernel’s ext4_direct_IO path is taken, which calls the block device’s __blockdev_direct_IO to perform DMA transfers.

7. File Prefetch (Read‑Ahead)

The kernel predicts sequential reads and prefetches pages into the cache. It maintains two windows in struct file_ra_state:

Current window : pages already cached and ready for immediate read.

Ahead window : pages that will be fetched asynchronously.

The algorithm adjusts window sizes based on access patterns (sequential vs random) and the posix_fadvise advice flags ( POSIX_FADV_NORMAL, POSIX_FADV_SEQUENTIAL, POSIX_FADV_RANDOM).

int posix_fadvise(int fd, off_t offset, off_t len, int advice);

Prefetch is triggered in several ways: normal reads, explicit readahead(), posix_fadvise(POSIX_FADV_WILLNEED), and memory‑mapped file accesses.

8. Prefetch Implementation

The core functions are: page_cache_sync_readahead – entry point for synchronous read‑ahead. ondemand_readahead – decides window sizes and when to start async prefetch. __do_page_cache_readahead – allocates pages and initiates the actual I/O.

static void page_cache_sync_readahead(struct address_space *mapping,
    struct file_ra_state *ra, struct file *filp,
    pgoff_t offset, unsigned long req_size) {
    if (!ra->ra_pages)
        return; // prefetch disabled
    if (filp && (filp->f_mode & FMODE_RANDOM)) {
        force_page_cache_readahead(mapping, filp, offset, req_size);
        return;
    }
    ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}

9. JDK NIO Writing an Ordinary File

Writing with a heap ByteBuffer follows these steps:

Copy data from the JVM heap buffer to a temporary DirectByteBuffer (first copy).

Invoke the native write0 system call (first context switch).

Kernel copies data from the user buffer into the page cache ( iov_iter_copy_from_user_atomic, second copy).

The page is marked dirty; later the kernel may write it back to disk (optional third copy via DMA).

FileChannel fileChannel = new RandomAccessFile(new File("file-read-write.txt"), "rw").getChannel();
ByteBuffer heapByteBuffer = ByteBuffer.allocate(4096);
fileChannel.write(heapByteBuffer);

In the kernel, IOUtil.write creates a temporary DirectByteBuffer, copies the heap buffer into it, and calls FileDispatcher.write, which maps to the native write0 implementation:

JNIEXPORT jint JNICALL Java_sun_nio_ch_FileDispatcherImpl_write0(JNIEnv *env, jclass clazz,
    jobject fdo, jlong address, jint len) {
    jint fd = fdval(env, fdo);
    void *buf = (void *)jlong_to_ptr(address);
    return convertReturnVal(env, write(fd, buf, len), JNI_FALSE);
}

The kernel’s vfs_write eventually calls generic_perform_write, which invokes the file system’s write_begin and write_end operations (e.g., ext4_write_begin, ext4_write_end) to allocate a page, copy data, and mark it dirty.

10. Dirty Page Write‑Back

Dirty pages are flushed to disk based on several conditions:

A periodic timer wakes the flusher thread (controlled by dirty_writeback_centisecs).

If the amount of dirty memory exceeds dirty_background_ratio or dirty_background_bytes, the kernel wakes the flusher asynchronously.

If dirty memory exceeds dirty_ratio or dirty_bytes, the writing process blocks and forces a synchronous write‑back.

Pages older than dirty_expire_centisecs are considered stale and are written back on the next flusher wake‑up.

The flusher thread runs on the workqueue bdi_wq. Its main loop ( wb_workfn) calls wb_do_writeback, which uses wb_writeback to select pages older than the expiration interval:

static long wb_writeback(struct bdi_writeback *wb, struct wb_writeback_work *work) {
    work->older_than_this = &oldest_jif;
    // ... iterate over inodes and write back pages older than this jiffies value ...
}

11. Kernel Parameters Controlling Write‑Back

The relevant /proc/sys/vm knobs are: dirty_background_ratio (default 10 % of available memory) – async wake‑up threshold. dirty_background_bytes – absolute byte threshold for async wake‑up (takes precedence over the ratio). dirty_ratio (default 20 %) – synchronous write‑back threshold for the writing process. dirty_bytes – absolute byte threshold for synchronous write‑back (takes precedence over the ratio). dirty_expire_centisecs (default 3000 → 30 s) – age after which a dirty page is considered stale. dirty_writeback_centisecs (default 500 → 5 s) – interval at which the flusher timer runs.

These parameters can be changed temporarily by writing to the corresponding files in /proc/sys/vm, via the sysctl command, or permanently by adding entries to /etc/sysctl.conf (or a file under /etc/sysctl.d/) and reloading with sysctl -p.

# Example permanent change
vm.dirty_background_ratio = 15
vm.dirty_ratio = 30
vm.dirty_writeback_centisecs = 300

12. Summary

The article walks through the complete path from JDK NIO APIs down to Linux kernel structures, explaining how Buffered IO leverages the page cache and radix tree, how Direct IO bypasses the cache, how the kernel prefetches data, and how dirty pages are managed and written back. Understanding these mechanisms helps developers choose the right IO mode and tune kernel parameters for optimal performance and data safety.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaNIOLinux kernelIOpage cache
Bin's Tech Cabin
Written by

Bin's Tech Cabin

Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.