How Linux ext4 Manages Files, Inodes, and Caching Internally
This article explains the design of Linux file systems, focusing on ext4's inode layout, block allocation, extents, directory storage, journaling modes, and the kernel's cached and direct I/O paths, complete with code snippets and structural diagrams.
Linux File System Overview
Linux file systems must have a strict organization, block‑based storage, an index area for locating file blocks, a cache layer for hot files, hierarchical directories, and kernel data structures that track which processes have opened which files.
Files are stored as blocks.
An index region speeds up block location.
Hot files benefit from a caching layer.
Directories are organized as folders for easy management.
The kernel maintains in‑memory structures linking files to processes.
ext Series Format – Inode and Block Storage
Disks are divided into equal‑sized blocks (default 4 KB). Each file has an inode that stores metadata and an array i_block of block pointers.
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner UID */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion time */
__le16 i_gid; /* Low 16 bits of Group ID */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
__le32 i_block[EXT4_N_BLOCKS]; /* Pointers to blocks */
/* ... other fields ... */
};The inode records permissions ( i_mode), owner UID/GID, size, timestamps, and the block pointers that hold the file’s data.
Direct block pointers ( i_block[0‑11]) store up to 12 block addresses. When a file exceeds this, i_block[12] points to an indirect block, i_block[13] to a doubly‑indirect block, and i_block[14] to a triply‑indirect block, forming a multi‑level lookup tree.
Extents – Reducing Fragmentation
To avoid many disk seeks for large files, ext4 introduces extents , which map a contiguous range of logical blocks to a contiguous range of physical blocks.
struct ext4_extent_header {
__le16 eh_magic; /* Magic number */
__le16 eh_entries; /* Number of valid entries */
__le16 eh_max; /* Capacity of entries */
__le16 eh_depth; /* Tree depth */
__le32 eh_generation;
};
struct ext4_extent {
__le32 ee_block; /* First logical block covered */
__le16 ee_len; /* Number of blocks covered */
__le16 ee_start_hi;/* High 16 bits of physical block */
__le32 ee_start_lo;/* Low 32 bits of physical block */
};
struct ext4_extent_idx {
__le32 ei_block; /* Index covers logical blocks from this */
__le32 ei_leaf_lo; /* Physical block of next level */
__le16 ei_leaf_hi; /* High 16 bits of physical block */
__u16 ei_unused;
};If the inode can hold an ext4_extent_header with up to four extents, the tree depth ( eh_depth) is zero (leaf node). Larger files cause the tree to split, increasing eh_depth.
Inode and Block Bitmaps
Both inode and block bitmaps are 4 KB, with each bit representing the allocation state of an inode or block (1 = used, 0 = free). When creating a file (via open(..., O_CREAT)), the kernel scans the inode bitmap for a free entry and similarly allocates blocks using the block bitmap.
File System Layout
The superblock ( ext4_super_block) stores global counts such as total inodes, total blocks, inodes per group, and blocks per group. Each block group has a descriptor ( ext4_group_desc) containing pointers to its inode bitmap, block bitmap, and inode table.
To avoid a single point of failure, the superblock and group descriptor tables are replicated in each block group. Ext4 further reduces metadata overhead with Meta Block Groups , where groups are clustered (64 groups per meta‑group) and each meta‑group stores only its own descriptors.
Directory Storage Format
Directories are regular files whose data blocks contain ext4_dir_entry records. The first two entries are “.” (current directory) and “..” (parent directory). When the EXT4_INDEX_FL flag is set, the directory uses a hash‑based index tree to speed up lookups.
Linux File Caching Layer (ext4)
Ext4 defines ext4_file_operations which point to ext4_file_read_iter and ext4_file_write_iter. These wrappers call the generic kernel helpers generic_file_read_iter and __generic_file_write_iter.
const struct file_operations ext4_file_operations = {
.read_iter = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,
/* ... */
};Two I/O paths exist:
Cached I/O : Data is first read into or written from the page cache. Writes are considered complete once data reaches the cache; the kernel later flushes dirty pages to disk.
Direct I/O : Applications bypass the page cache and read/write directly to the underlying storage, reducing copy overhead.
Cached Write Path
The function generic_perform_write performs the following steps for each page:
Call address_space->write_begin to prepare the page.
Copy data from user space to the page with iov_iter_copy_from_user_atomic.
Call address_space->write_end to finish the write.
Invoke balance_dirty_pages_ratelimited to decide whether to start write‑back of dirty pages.
ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
do {
struct page *page;
unsigned long offset, bytes;
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
flush_dcache_page(page);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
pos += copied;
written += copied;
balance_dirty_pages_ratelimited(mapping);
} while (iov_iter_count(i));
}During write_begin, ext4 may start a journal transaction (Journal mode) or simply log metadata (ordered mode, default) before data is flushed.
Cached Read Path
Reading uses generic_file_buffered_read, which first looks for the page in the cache. If missing, it triggers synchronous readahead, then asynchronous readahead, and finally copies the page to user space with copy_page_to_iter.
static ssize_t generic_file_buffered_read(struct kiocb *iocb,
struct iov_iter *iter,
ssize_t written)
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
for (;;) {
struct page *page = find_get_page(mapping, index);
if (!page) {
if (iocb->ki_flags & IOCB_NOWAIT) goto would_block;
page_cache_sync_readahead(mapping, ra, filp, index, last_index-index);
page = find_get_page(mapping, index);
if (!page) goto no_cached_page;
}
if (PageReadahead(page))
page_cache_async_readahead(mapping, ra, filp, page, index, last_index-index);
ret = copy_page_to_iter(page, offset, nr, iter);
}
}When the number of dirty pages exceeds a threshold, balance_dirty_pages_ratelimited schedules background write‑back, and explicit sync calls or memory pressure also trigger flushing.
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
struct bdi_writeback *wb = NULL;
int ratelimit;
/* ... */
if (unlikely(current->nr_dirtied >= ratelimit))
balance_dirty_pages(mapping, wb, current->nr_dirtied);
}In summary, ext4 combines a rich on‑disk layout (inodes, extents, meta block groups) with kernel mechanisms for cached and direct I/O, journaling, and intelligent write‑back to provide reliable and performant file storage on Linux.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
