Databases 11 min read

Understanding B+ Tree Indexes: Disk vs Memory, Structure, and Operations

The article explains how B+ trees serve as efficient disk-based indexes, contrasting disk and memory I/O, describing node and leaf structures, retrieval methods, and dynamic insertion and deletion operations, highlighting their advantages for large-scale data storage and query performance.

Full-Stack Internet Architecture

Nov 21, 2020

Understanding B+ Tree Indexes: Disk vs Memory, Structure, and Operations

1 Differences Between Disk and Memory I/O

When a system must handle massive data, it cannot keep everything in memory and must rely on disks for storage and retrieval. Databases support many index types such as hash, full-text, and B+ tree indexes; this article focuses on the pros and cons of B+ trees.

Interviewers often ask about B+ trees, and many candidates give repetitive answers; this article aims to provide a fresh perspective.

We typically encounter mechanical hard drives (HDD) and solid‑state drives (SSD). Memory, being semiconductor, allows random access via addresses, offering fast speed but limited capacity. Disks are mechanical; each access requires the platter to rotate to the read/write head, making random reads/writes much slower than memory, though sequential access narrows the performance gap.

The smallest disk I/O unit is a sector, usually 4 KB. Operating systems read multiple sectors at once, forming a block. For large sequential reads, disk efficiency improves because whole blocks are transferred.

Consider an ordered array stored on disk in blocks. A binary search first loads the middle block into memory, then continues the search, repeatedly loading blocks as needed, which is inefficient due to many disk‑to‑memory transfers.

2 Data and Index Separation

Using a public security system as an example, each user has a unique ID and extensive profile data that cannot all reside in memory, so the data is stored on disk while the ID remains in memory for quick lookup.

Store user IDs and the disk locations of their profiles in an ordered array, keeping only two elements in memory.

When data changes frequently, ordered arrays become inefficient, and binary search trees or hash tables are preferred. However, hash tables do not support range queries, so binary search trees are often used.

If the index itself cannot fit entirely in memory, it can also be stored on disk, leading to the introduction of B+ trees.

2 B+ Tree

Node size equals block size.

Operating systems read disks in block units. If only a few bytes are needed, the whole block is still read, which can be inefficient. B+ trees set each node size to match a block and store an ordered array of keys in each node, maximizing read efficiency.

Internal nodes vs leaf nodes.

Internal nodes store keys and pointers to maintain tree structure but do not store the actual data. Leaf nodes store keys and the associated data without structural pointers, maximizing node space utilization.

B+ trees use a doubly linked list, providing good range query capability and flexible adjustments.

In summary, a B+ tree is a fully balanced m‑ary tree.

3 B+ Tree Retrieval Scheme

The search process starts by locating the interval where the query value falls, reading the pointer to the next block, loading that block, and repeating until a leaf node is reached. Within the leaf node, a binary search determines if the element exists, and if the leaf stores a pointer to detailed data, an additional disk read retrieves it.

A 2 KB node can hold about 200 elements; a four‑level B+ tree can index 200⁴ ≈ 1.6 billion elements. With such a tree, only four disk accesses are needed for a lookup. In practice, the first three levels can reside in memory, requiring just one disk access for the leaf level.

4 B+ Tree Dynamic Adjustments

Now we examine how B+ trees handle insertions and deletions.

Inserting an element with ID 6 into a three‑element B+ tree: if the target leaf node is not full, the element is inserted directly.

Inserting an element with ID 10 would normally go after 9, but the leaf node is full, so the node splits into two, distributing the data evenly.

The split propagates to the parent; if the parent is also full, it must split as well.

Breaking a large problem into smaller ones is a key skill. A B+ tree is essentially composed of basic data structures (arrays, linked lists, trees). Its retrieval process is binary search; when fully loaded in memory, its performance matches that of ordered arrays or binary search trees, but it excels by keeping the index on disk and separating index from data, resulting in a compact, efficient index.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Structures storage B+Tree disk I/O database indexing

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.