Understanding the Linux Kernel Buffer Cache: Structure, Initialization, and Management
This article explains the role of the Linux kernel buffer cache in file I/O, describes its data structures such as buffer_head, details how buffers are organized with hash tables and free lists, and walks through initialization, allocation, release, synchronization, and optimization techniques with code examples.
1. Buffer Cache Overview
1.1 What is a buffer cache
In the Linux kernel, a buffer cache is a region of memory that temporarily stores file data, acting as a bridge between the file system and block devices to improve I/O efficiency.
1.2 Why a buffer cache is needed
Because block device access is much slower than memory access, the kernel allocates a pool of buffer blocks (buffer cache) to reduce the number of direct device reads/writes, thereby boosting overall system performance.
1.3 How the buffer cache works
When reading, the kernel first checks if the required data is already in a buffer (cache hit). If not (cache miss), it reads the block from the device into a buffer. When writing, data is first placed in a buffer and later flushed to the device either when the buffer is full or when explicit sync operations are invoked.
2. Buffer Cache Structure
2.1 Key data structures
The core structure is struct buffer_head , defined in the Linux 0.11 source as:
struct buffer_head {
char *b_data; /* pointer to 1024‑byte data block */
unsigned long b_blocknr;/* block number */
unsigned short b_dev; /* device number (0 = free) */
unsigned char b_uptodate;/* data up‑to‑date flag */
unsigned char b_dirt; /* dirty flag */
unsigned char b_count; /* users of this buffer */
unsigned char b_lock; /* lock flag */
struct task_struct *b_wait;/* tasks waiting for unlock */
struct buffer_head *b_prev, *b_next; /* hash queue links */
struct buffer_head *b_prev_free, *b_next_free;/* free list links */
};Each member is explained in the article (data pointer, block number, device number, up‑to‑date flag, dirty flag, reference count, lock, wait queue, hash links, and free‑list links).
2.2 Organization of buffers
Buffers are organized using a double‑linked free list and a hash table. The free list provides quick allocation of unused buffers, while the hash table (indexed by a hash of device number and block number) enables fast lookup of already cached blocks.
3. Buffer Cache Initialization and Management
3.1 Initialization process
During kernel boot, buffer_init sets up the buffer pool. The function determines the high address of the buffer area, divides it into 1 KB blocks, creates a buffer_head for each block, links them into the free list, and clears the hash table.
void buffer_init(long buffer_end)
{
struct buffer_head *h = start_buffer;
void *b;
int i;
if (buffer_end == 1 << 20)
b = (void *)(640 * 1024);
else
b = (void *)buffer_end;
while ((b -= BLOCK_SIZE) >= ((void *)(h + 1))) {
h->b_dev = 0;
h->b_dirt = 0;
h->b_count = 0;
h->b_lock = 0;
h->b_uptodate = 0;
h->b_wait = NULL;
h->b_next = NULL;
h->b_prev = NULL;
h->b_data = (char *)b;
h->b_prev_free = h - 1;
h->b_next_free = h + 1;
h++;
NR_BUFFERS++;
if (b == (void *)0x100000)
b = (void *)0xA0000;
}
h--;
free_list = start_buffer;
free_list->b_prev_free = h;
h->b_next_free = free_list;
for (i = 0; i < NR_HASH; i++)
hash_table[i] = NULL;
}3.2 Allocation and release of buffers
Buffers are allocated with getblk and released with brelse . getblk first searches the hash table; if the buffer is not present, it picks a free buffer, updates its fields, and returns it. brelse decrements the reference count and, when it reaches zero, puts the buffer back on the free list (writing it out first if it is dirty).
struct buffer_head *getblk(int dev, int blocknr)
{
struct buffer_head *bh;
int hash = (blocknr ^ dev) % NR_HASH;
bh = hash_table[hash];
while (bh) {
if (bh->b_dev == dev && bh->b_blocknr == blocknr) {
if (!bh->b_lock) {
bh->b_count++;
return bh;
} else {
sleep_on(&bh->b_wait);
}
}
bh = bh->b_next;
}
bh = free_list;
while (bh) {
if (!bh->b_count && !bh->b_dirt) {
remove_from_queues(bh);
bh->b_dev = dev;
bh->b_blocknr = blocknr;
bh->b_count = 1;
insert_into_queues(bh);
return bh;
}
bh = bh->b_next_free;
}
// fallback handling omitted
}
void brelse(struct buffer_head *bh)
{
if (!bh) return;
bh->b_count--;
if (!bh->b_count) {
if (bh->b_dirt)
ll_rw_block(WRITE, bh);
remove_from_queues(bh);
bh->b_next_free = free_list;
bh->b_prev_free = free_list->b_prev_free;
free_list->b_prev_free->b_next_free = bh;
free_list->b_prev_free = bh;
}
}3.3 Synchronization mechanisms
Functions such as sync , fsync , and fdatasync ensure that dirty buffers are written to the underlying block device. sync queues all dirty buffers, fsync forces the write for a specific file and waits for completion, while fdatasync flushes only file data, not metadata.
4. Buffer Cache Operation Functions
4.1 Read functions
The kernel provides bread , breada , and bread_page for reading. bread obtains a buffer with getblk , reads from the device if the buffer is not up‑to‑date, and returns the buffer. breada also pre‑reads adjacent blocks, and bread_page reads a whole page (typically four blocks) at once.
struct buffer_head *bread(int dev, int block)
{
struct buffer_head *bh = getblk(dev, block);
if (!bh->b_uptodate)
ll_rw_block(READ, bh);
wait_on_buffer(bh);
if (bh->b_uptodate)
return bh;
brelse(bh);
return NULL;
}4.2 Write functions
Writing uses bwrite , which marks a buffer dirty and, if it is the sole user, immediately issues a write to the device; otherwise the buffer is released and will be written later by the sync machinery.
void bwrite(struct buffer_head *bh)
{
if (!bh) return;
if (bh->b_dirt) return;
bh->b_dirt = 1;
if (bh->b_count > 1) {
brelse(bh);
return;
}
ll_rw_block(WRITE, bh);
}5. Performance Optimization Strategies
5.1 Tuning tips
Increasing the buffer cache size (e.g., adjusting vm.dirty_ratio and vm.dirty_background_ratio ) reduces I/O frequency but consumes more RAM. Optimizing the hash function and free‑list management improves lookup and allocation speed. Adjusting sync policies (delaying writes for low‑priority workloads or forcing immediate sync for databases) balances performance and data safety.
5.2 Common issues and solutions
Buffer contention is mitigated with proper locking (the b_lock field). Data loss after crashes is avoided by using journaling file systems (e.g., ext4) and by regularly invoking sync or fsync to flush dirty buffers.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.