Fundamentals 21 min read

Understanding Linux ext4: Inodes, Extents, and File Caching Explained

This article explains the core design of Linux file systems, covering strict organization, block allocation, inode structures, ext4 extent trees, block groups, superblock metadata, directory storage formats, and the kernel's cached and direct I/O paths for reading and writing files.

Liangxu Linux

Nov 2, 2020

Understanding Linux ext4: Inodes, Extents, and File Caching Explained

Linux File System Characteristics

Files are organized in strict block‑based structures to enable efficient storage and retrieval.

An index area allows quick location of all blocks belonging to a file.

Hot files that are frequently read or written are served through a cache layer.

Files are grouped into directories for easier management and lookup.

The kernel maintains in‑memory data structures tracking which processes have opened which files.

Overall, the main functions of a file system are summarized in the diagram below.

ext Series File System Layout

Inode and Block Storage

Disks are divided into equal‑sized units called blocks (default 4 KB). Files are stored in these blocks, allowing non‑contiguous allocation for flexibility.

Each file and directory has an inode that stores metadata and pointers to its data blocks.

The inode structure in ext4 is defined as:

struct ext4_inode {
    __le16  i_mode;      /* File mode */
    __le16  i_uid;       /* Low 16 bits of Owner Uid */
    __le32  i_size_lo;  /* Size in bytes */
    __le32  i_atime;    /* Access time */
    __le32  i_ctime;    /* Inode change time */
    __le32  i_mtime;    /* Modification time */
    __le32  i_dtime;    /* Deletion time */
    __le16  i_gid;      /* Low 16 bits of Group Id */
    __le16  i_links_count; /* Links count */
    __le32  i_blocks_lo;   /* Blocks count */
    __le32  i_flags;      /* File flags */
    ...
    __le32  i_block[EXT4_N_BLOCKS]; /* Pointers to blocks */
    __le32  i_generation;  /* File version (for NFS) */
    __le32  i_file_acl_lo; /* File ACL */
    __le32  i_size_high;
    ...
};

The inode records permissions (i_mode), owner (i_uid), group (i_gid), size (i_size_lo/high), timestamps, and an array of block pointers (i_block). The number of block pointers is defined by:

#define EXT4_NDIR_BLOCKS        12
#define EXT4_IND_BLOCK          EXT4_NDIR_BLOCKS
#define EXT4_DIND_BLOCK         (EXT4_IND_BLOCK + 1)
#define EXT4_TIND_BLOCK         (EXT4_DIND_BLOCK + 1)
#define EXT4_N_BLOCKS           (EXT4_TIND_BLOCK + 1)

In ext2/3 the first 12 entries of i_block hold direct block addresses. For larger files, i_block[12] points to an indirect block, i_block[13] to a doubly‑indirect block, and i_block[14] to a triply‑indirect block, forming a tree of block references.

To avoid many disk seeks for large files, ext4 introduces extents , which map a range of contiguous blocks with a single descriptor, dramatically improving performance and reducing fragmentation.

Each extent node is described by ext4_extent_header:

struct ext4_extent_header {
    __le16  eh_magic;   /* Magic number */
    __le16  eh_entries; /* Number of valid entries */
    __le16  eh_max;     /* Maximum entries */
    __le16  eh_depth;   /* Tree depth */
    __le32  eh_generation; /* Generation of the tree */
};

Leaf entries ( ext4_extent) point directly to physical blocks, while index entries ( ext4_extent_idx) point to lower‑level nodes:

struct ext4_extent {
    __le32  ee_block;   /* First logical block covered */
    __le16  ee_len;     /* Number of blocks covered */
    __le16  ee_start_hi;/* High 16 bits of physical block */
    __le32  ee_start_lo;/* Low 32 bits of physical block */
};

struct ext4_extent_idx {
    __le32  ei_block;   /* Logical block covered by this index */
    __le32  ei_leaf_lo;/* Low 32 bits of leaf block address */
    __le16  ei_leaf_hi;/* High 16 bits of leaf block address */
    __u16   ei_unused;
};

If a file is small enough, the inode itself can hold an ext4_extent_header and up to four extents (tree depth 0). Larger files cause the extent tree to grow, with depth >0 nodes stored in separate 4 KB blocks. Each block can hold 340 extents, each representing up to 128 MB, allowing a single file to be described up to ~42.5 GB.

Inode and Block Bitmaps

Both inode and block bitmaps are 4 KB, where each bit indicates allocation (1 = used, 0 = free). When creating a new file (e.g., via open(..., O_CREAT)), the kernel reads the inode bitmap to find a free inode and later uses the block bitmap to allocate data blocks.

Filesystem Metadata Structures

Data block bitmaps, inode tables, and group descriptors are organized into block groups. Each block group has its own descriptor ( ext4_group_desc) containing pointers to the inode bitmap, block bitmap, and inode table. A superblock ( ext4_super_block) holds global counts such as total inodes, total blocks, inodes per group, and blocks per group.

To protect against loss, the superblock and group descriptor tables are replicated in each block group. However, storing full copies in every group wastes space, so the Meta Block Group feature stores only a subset of descriptors per group (e.g., 64 groups per meta‑group), reducing overhead while still providing redundancy.

Directory Storage Format

Directories are regular files with their own inode. Their data blocks contain ext4_dir_entry records, each storing a child file name and its inode number. The first two entries are "." (current directory) and ".." (parent directory).

If the inode has the EXT4_INDEX_FL flag, the directory uses a hashed index tree, allowing fast lookup by name.

Linux File Caching Layer (ext4)

ext4 File Operations

const struct file_operations ext4_file_operations = {
    ...
    .read_iter  = ext4_file_read_iter,
    .write_iter = ext4_file_write_iter,
    ...
};

ext4_file_read_iter

calls generic_file_read_iter, and ext4_file_write_iter calls __generic_file_write_iter. These functions decide whether to use cached I/O or direct I/O based on the IOCB_DIRECT flag.

Cached Write Path

ssize_t generic_perform_write(struct file *file,
                               struct iov_iter *i, loff_t pos)
{
    struct address_space *mapping = file->f_mapping;
    const struct address_space_operations *a_ops = mapping->a_ops;
    do {
        struct page *page;
        unsigned long offset;   /* Offset into pagecache page */
        unsigned long bytes;    /* Bytes to write to page */
        status = a_ops->write_begin(file, mapping, pos, bytes, flags,
                                    &page, &fsdata);
        copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
        flush_dcache_page(page);
        status = a_ops->write_end(file, mapping, pos, bytes, copied,
                                  page, fsdata);
        pos += copied;
        written += copied;
        balance_dirty_pages_ratelimited(mapping);
    } while (iov_iter_count(i));
}

The steps are:

Prepare the page via write_begin (including journaling work).

Copy data from user space to the kernel page with iov_iter_copy_from_user_atomic.

Finalize the write with write_end, marking the page dirty.

Trigger write‑back if too many dirty pages exist via balance_dirty_pages_ratelimited.

Journaling modes:

Journal : logs both metadata and data (safest, slowest).

Ordered (default): logs only metadata, ensuring data is on disk before metadata is committed.

Writeback : logs only metadata, without ordering guarantees (fastest, least safe).

Cached Read Path

static ssize_t generic_file_buffered_read(struct kiocb *iocb,
                                          struct iov_iter *iter,
                                          ssize_t written)
{
    struct file *filp = iocb->ki_filp;
    struct address_space *mapping = filp->f_mapping;
    struct inode *inode = mapping->host;
    for (;;) {
        struct page *page;
        pgoff_t index = ...;
        page = find_get_page(mapping, index);
        if (!page) {
            if (iocb->ki_flags & IOCB_NOWAIT)
                goto would_block;
            page_cache_sync_readahead(mapping, ...);
            page = find_get_page(mapping, index);
            if (unlikely(page == NULL))
                goto no_cached_page;
        }
        if (PageReadahead(page))
            page_cache_async_readahead(mapping, ...);
        ret = copy_page_to_iter(page, offset, nr, iter);
    }
}

The function first looks for a cached page; if missing, it performs synchronous readahead, then possibly asynchronous readahead, and finally copies the page data to user space.

Write‑back of dirty pages occurs when the kernel decides (e.g., via balance_dirty_pages_ratelimited), when the user explicitly calls sync, when memory pressure forces reclamation, or when a dirty page has been dirty for too long.

Author: luozhiyun Source: https://www.cnblogs.com/luozhiyun/p/13061199.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux inode Filesystem ext4

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.