Understanding Linux ext4: Inodes, Extents, and File Caching Explained
This article explains the core design of Linux file systems, covering strict organization, block allocation, inode structures, ext4 extent trees, block groups, superblock metadata, directory storage formats, and the kernel's cached and direct I/O paths for reading and writing files.
Linux File System Characteristics
Files are organized in strict block‑based structures to enable efficient storage and retrieval.
An index area allows quick location of all blocks belonging to a file.
Hot files that are frequently read or written are served through a cache layer.
Files are grouped into directories for easier management and lookup.
The kernel maintains in‑memory data structures tracking which processes have opened which files.
Overall, the main functions of a file system are summarized in the diagram below.
ext Series File System Layout
Inode and Block Storage
Disks are divided into equal‑sized units called blocks (default 4 KB). Files are stored in these blocks, allowing non‑contiguous allocation for flexibility.
Each file and directory has an inode that stores metadata and pointers to its data blocks.
The inode structure in ext4 is defined as:
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner Uid */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion time */
__le16 i_gid; /* Low 16 bits of Group Id */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
...
__le32 i_block[EXT4_N_BLOCKS]; /* Pointers to blocks */
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl_lo; /* File ACL */
__le32 i_size_high;
...
};The inode records permissions (i_mode), owner (i_uid), group (i_gid), size (i_size_lo/high), timestamps, and an array of block pointers (i_block). The number of block pointers is defined by:
#define EXT4_NDIR_BLOCKS 12
#define EXT4_IND_BLOCK EXT4_NDIR_BLOCKS
#define EXT4_DIND_BLOCK (EXT4_IND_BLOCK + 1)
#define EXT4_TIND_BLOCK (EXT4_DIND_BLOCK + 1)
#define EXT4_N_BLOCKS (EXT4_TIND_BLOCK + 1)In ext2/3 the first 12 entries of i_block hold direct block addresses. For larger files, i_block[12] points to an indirect block, i_block[13] to a doubly‑indirect block, and i_block[14] to a triply‑indirect block, forming a tree of block references.
To avoid many disk seeks for large files, ext4 introduces extents , which map a range of contiguous blocks with a single descriptor, dramatically improving performance and reducing fragmentation.
Each extent node is described by ext4_extent_header:
struct ext4_extent_header {
__le16 eh_magic; /* Magic number */
__le16 eh_entries; /* Number of valid entries */
__le16 eh_max; /* Maximum entries */
__le16 eh_depth; /* Tree depth */
__le32 eh_generation; /* Generation of the tree */
};Leaf entries ( ext4_extent) point directly to physical blocks, while index entries ( ext4_extent_idx) point to lower‑level nodes:
struct ext4_extent {
__le32 ee_block; /* First logical block covered */
__le16 ee_len; /* Number of blocks covered */
__le16 ee_start_hi;/* High 16 bits of physical block */
__le32 ee_start_lo;/* Low 32 bits of physical block */
};
struct ext4_extent_idx {
__le32 ei_block; /* Logical block covered by this index */
__le32 ei_leaf_lo;/* Low 32 bits of leaf block address */
__le16 ei_leaf_hi;/* High 16 bits of leaf block address */
__u16 ei_unused;
};If a file is small enough, the inode itself can hold an ext4_extent_header and up to four extents (tree depth 0). Larger files cause the extent tree to grow, with depth >0 nodes stored in separate 4 KB blocks. Each block can hold 340 extents, each representing up to 128 MB, allowing a single file to be described up to ~42.5 GB.
Inode and Block Bitmaps
Both inode and block bitmaps are 4 KB, where each bit indicates allocation (1 = used, 0 = free). When creating a new file (e.g., via open(..., O_CREAT)), the kernel reads the inode bitmap to find a free inode and later uses the block bitmap to allocate data blocks.
Filesystem Metadata Structures
Data block bitmaps, inode tables, and group descriptors are organized into block groups. Each block group has its own descriptor ( ext4_group_desc) containing pointers to the inode bitmap, block bitmap, and inode table. A superblock ( ext4_super_block) holds global counts such as total inodes, total blocks, inodes per group, and blocks per group.
To protect against loss, the superblock and group descriptor tables are replicated in each block group. However, storing full copies in every group wastes space, so the Meta Block Group feature stores only a subset of descriptors per group (e.g., 64 groups per meta‑group), reducing overhead while still providing redundancy.
Directory Storage Format
Directories are regular files with their own inode. Their data blocks contain ext4_dir_entry records, each storing a child file name and its inode number. The first two entries are "." (current directory) and ".." (parent directory).
If the inode has the EXT4_INDEX_FL flag, the directory uses a hashed index tree, allowing fast lookup by name.
Linux File Caching Layer (ext4)
ext4 File Operations
const struct file_operations ext4_file_operations = {
...
.read_iter = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,
...
}; ext4_file_read_itercalls generic_file_read_iter, and ext4_file_write_iter calls __generic_file_write_iter. These functions decide whether to use cached I/O or direct I/O based on the IOCB_DIRECT flag.
Cached Write Path
ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
do {
struct page *page;
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
flush_dcache_page(page);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
pos += copied;
written += copied;
balance_dirty_pages_ratelimited(mapping);
} while (iov_iter_count(i));
}The steps are:
Prepare the page via write_begin (including journaling work).
Copy data from user space to the kernel page with iov_iter_copy_from_user_atomic.
Finalize the write with write_end, marking the page dirty.
Trigger write‑back if too many dirty pages exist via balance_dirty_pages_ratelimited.
Journaling modes:
Journal : logs both metadata and data (safest, slowest).
Ordered (default): logs only metadata, ensuring data is on disk before metadata is committed.
Writeback : logs only metadata, without ordering guarantees (fastest, least safe).
Cached Read Path
static ssize_t generic_file_buffered_read(struct kiocb *iocb,
struct iov_iter *iter,
ssize_t written)
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
for (;;) {
struct page *page;
pgoff_t index = ...;
page = find_get_page(mapping, index);
if (!page) {
if (iocb->ki_flags & IOCB_NOWAIT)
goto would_block;
page_cache_sync_readahead(mapping, ...);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
}
if (PageReadahead(page))
page_cache_async_readahead(mapping, ...);
ret = copy_page_to_iter(page, offset, nr, iter);
}
}The function first looks for a cached page; if missing, it performs synchronous readahead, then possibly asynchronous readahead, and finally copies the page data to user space.
Write‑back of dirty pages occurs when the kernel decides (e.g., via balance_dirty_pages_ratelimited), when the user explicitly calls sync, when memory pressure forces reclamation, or when a dirty page has been dirty for too long.
Author: luozhiyun Source: https://www.cnblogs.com/luozhiyun/p/13061199.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
