Understanding Linux ext4 File System: Inodes, Extents, and Caching Mechanisms
This article explains the core design of Linux file systems, covering strict organization, block and inode structures, ext4 formatting details such as extents and meta block groups, directory storage, journaling modes, and the kernel's cached and direct I/O paths for reading and writing files.
Linux File System Characteristics
Files must be organized strictly so they can be stored in block units.
An index area is required to locate the blocks belonging to a file.
Hot files that are frequently read or written should benefit from a cache layer.
Files are arranged in directories for easy management and lookup.
The Linux kernel maintains in‑memory data structures tracking which files are opened by which processes.
Ext4 File System Layout
Disk space is divided into equal‑sized units called blocks (default 4 KB, a multiple of the sector size). When formatting, the block size can be chosen.
Files are stored as a collection of blocks, allowing non‑contiguous allocation for flexibility.
Each file and directory has an inode; a directory itself is a special file with its own inode.
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner UID */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion time */
__le16 i_gid; /* Low 16 bits of Group ID */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
__le32 i_block[EXT4_N_BLOCKS]; /* Pointers to blocks */
__le32 i_generation;/* File version (for NFS) */
__le32 i_file_acl_lo;/* File ACL */
__le32 i_size_high;
...
};The inode stores permissions (i_mode), owner UID/GID, size, timestamps, and an array i_block that points to data blocks.
#define EXT4_NDIR_BLOCKS 12
#define EXT4_IND_BLOCK EXT4_NDIR_BLOCKS
#define EXT4_DIND_BLOCK (EXT4_IND_BLOCK + 1)
#define EXT4_TIND_BLOCK (EXT4_DIND_BLOCK + 1)
#define EXT4_N_BLOCKS (EXT4_TIND_BLOCK + 1)In ext2/ext3 the first 12 entries of i_block hold direct block addresses. Entry 12 points to a single indirect block, entry 13 to a double‑indirect block, and entry 14 to a triple‑indirect block, enabling storage of very large files.
To improve large‑file performance, ext4 replaces the indirect‑block chain with extents , a tree‑like structure that records a range of contiguous blocks.
struct ext4_extent_header {
__le16 eh_magic; /* Magic number */
__le16 eh_entries; /* Number of valid entries */
__le16 eh_max; /* Capacity of entries */
__le16 eh_depth; /* 0 = leaf node, >0 = internal node */
__le32 eh_generation; /* Generation of the tree */
}; struct ext4_extent {
__le32 ee_block; /* First logical block covered */
__le16 ee_len; /* Number of blocks covered */
__le16 ee_start_hi;/* High 16 bits of physical block */
__le32 ee_start_lo;/* Low 32 bits of physical block */
}; struct ext4_extent_idx {
__le32 ei_block; /* First logical block covered by this index */
__le32 ei_leaf_lo; /* Low 32 bits of leaf block address */
__le16 ei_leaf_hi;/* High 16 bits of leaf block address */
__u16 ei_unused;
};If the file is small enough, the inode can hold an ext4_extent_header with up to four extents (depth 0). Larger files cause the extent tree to grow; internal nodes have eh_depth > 0 and point to child nodes.
Bitmap Management
Both inode and block allocation are tracked by 4 KB bitmap blocks: each bit represents one inode or one data block; a set bit means the entry is in use.
File Creation Process
When a program calls open(..., O_CREAT), the kernel searches the directory path, creates a new inode by finding a zero bit in the inode bitmap, and allocates data blocks using the block bitmap.
Superblock and Group Descriptors
The superblock ( ext4_super_block) holds global filesystem statistics such as total inode count, total block count, inodes per group, and blocks per group.
Data blocks are organized into block groups. Each group has a descriptor ( ext4_group_desc) that records the locations of the group's inode bitmap, block bitmap, and inode table.
To avoid storing a full copy of the group‑descriptor table in every group (which wastes space), ext4 introduces Meta Block Groups . A meta block group contains 64 block groups and stores only the descriptors for those 64 groups, reducing metadata size and improving resilience.
Directory Storage Format
Directories are regular files with an inode. Their data blocks contain ext4_dir_entry records, each holding a filename and the corresponding inode number. The first two entries are “.” (current directory) and “..” (parent directory).
If the inode has the EXT4_INDEX_FL flag, the directory uses an indexed tree: leaf nodes store ext4_dir_entry lists, while internal nodes store hash‑to‑block mappings, enabling fast look‑ups even in directories with many entries.
Ext4 I/O Paths
Ext4 defines a ext4_file_operations structure. Read operations go through ext4_file_read_iter → generic_file_read_iter. Write operations go through ext4_file_write_iter → __generic_file_write_iter.
Two I/O models exist:
Cached I/O : Data is first copied into the page cache; reads check the cache first, writes copy data into the cache and mark pages dirty. The kernel later flushes dirty pages to disk.
Direct I/O : User space accesses the disk directly, bypassing the page cache, reducing data copies.
Write Path Implementation
ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
do {
struct page *page;
unsigned long offset, bytes;
/* write_begin prepares the page */
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
/* copy data from user to the page */
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
flush_dcache_page(page);
/* write_end finalises the write */
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
pos += copied;
written += copied;
balance_dirty_pages_ratelimited(mapping);
} while (iov_iter_count(i));
return written;
}Key steps:
Call address_space->write_begin to prepare the target page.
Copy user data into the page with iov_iter_copy_from_user_atomic.
Finalize the write via address_space->write_end, which may journal the operation.
Invoke balance_dirty_pages_ratelimited to decide if dirty pages should be flushed.
Ext4 is a journaling filesystem. Three journal modes exist:
Journal : Both data and metadata are journaled (safest, slowest).
Ordered (default): Only metadata is journaled; data must be on disk before its metadata is committed.
Writeback : Only metadata is journaled; data may be written after its metadata (fastest, least safe).
struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags)
{
struct page *page;
int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
page = pagecache_get_page(mapping, index, fgp_flags,
mapping_gfp_mask(mapping));
if (page)
wait_for_stable_page(page);
return page;
} size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
char *kaddr = kmap_atomic(page), *p = kaddr + offset;
iterate_all_kinds(i, bytes, v,
copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
v.bv_offset, v.bv_len),
memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len));
kunmap_atomic(kaddr);
return bytes;
} void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
struct bdi_writeback *wb = NULL;
int ratelimit;
...
if (unlikely(current->nr_dirtied >= ratelimit))
balance_dirty_pages(mapping, wb, current->nr_dirtied);
...
}Dirty pages are flushed when the count exceeds a threshold, when the user calls sync, when memory pressure forces reclamation, or when pages have been dirty for too long.
Read Path Implementation
static ssize_t generic_file_buffered_read(struct kiocb *iocb,
struct iov_iter *iter, ssize_t written)
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
for (;;) {
struct page *page;
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
page = find_get_page(mapping, index);
if (!page) {
if (iocb->ki_flags & IOCB_NOWAIT)
goto would_block;
page_cache_sync_readahead(mapping, ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
}
if (PageReadahead(page))
page_cache_async_readahead(mapping, ra, filp, page,
index, last_index - index);
ret = copy_page_to_iter(page, offset, nr, iter);
}
return ret;
}The read path first looks for the page in the page cache. If missing, it triggers synchronous readahead, then retries. Once a page is found, it may start asynchronous readahead for subsequent pages and finally copies the page data to user space.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
