Understanding B+ Tree Indexes: Disk vs Memory, Structure, and Operations
This article explains why large data sets must be stored on disk, compares disk and memory access characteristics, introduces the B+ tree as an efficient disk‑based index, and details its structure, search process, and dynamic adjustments for insertions and deletions.
When a system needs to handle massive amounts of data, it cannot keep everything in memory and must rely on disk storage for persistence and retrieval. Common index types in databases include hash indexes, full‑text indexes, and B+ tree indexes; this article focuses on the advantages and disadvantages of using B+ trees.
Interviewers often ask about B+ trees, and many answers become repetitive; this piece aims to provide a fresh perspective.
1. Differences Between Disk I/O and Memory I/O
Memory is a semiconductor device offering random access at high speed but limited capacity, while disks (both HDD and SSD) are mechanical devices that require the platter to rotate and the head to seek, making random access much slower. Sequential access on disks can approach memory speeds because the disk reads whole sectors (typically 4 KB) and groups them into blocks.
When searching an ordered array stored on disk using binary search, each step may require loading a new block into memory, leading to many costly disk reads; therefore, minimizing disk accesses is crucial.
2. Separating Data and Index
Using a public security system as an example, each user has a unique ID and extensive personal information that cannot all reside in memory, so the data is stored on disk while the index (ID → disk location) remains in memory.
Ordered arrays can store IDs and pointers to disk locations, keeping only two elements in memory.
However, in scenarios with frequent data changes, ordered arrays become inefficient, and structures like binary search trees or hash tables are preferred. Since hash tables do not support range queries, binary search trees are often chosen.
If the index itself becomes too large for memory, it can also be placed on disk, which leads to the introduction of B+ trees.
3. B+ Tree
Node size equals block size.
Operating systems read disk data in block units; a B+ tree aligns each node with a block, storing an ordered array of keys in each node to maximize block utilization and read efficiency.
Internal nodes vs. leaf nodes.
Internal nodes store keys and pointers to child nodes but not the actual data, while leaf nodes store keys together with the corresponding data and no structural pointers, optimizing space usage.
B+ trees use doubly linked lists for leaf nodes, providing good range‑query performance and flexible adjustments.
4. B+ Tree Search Process
The search starts by locating the target key between two adjacent elements, reading the pointer to the next block, and repeating this process until a leaf node is reached. Within the leaf node, a binary search determines whether the key exists, and if the leaf stores a pointer to the actual record, an additional disk read retrieves the full data.
A 2 KB node can hold about 200 keys; a four‑level B+ tree can index roughly 200⁴ (≈1.6 billion) entries, requiring at most four disk accesses. In practice, the upper three levels can be cached in memory, leaving only the lowest level on disk.
5. Dynamic Adjustments of B+ Trees
Insertion: If the target leaf node has space, the key is inserted directly; otherwise, the leaf splits, creating a new node and redistributing keys. If the parent becomes full, it also splits, propagating upward as needed.
Deletion follows similar rebalancing rules (not detailed here).
Breaking a large problem into smaller, manageable pieces is essential; a B+ tree is essentially a combination of basic data structures (arrays, linked lists, trees). Its retrieval process is a series of binary searches, and when fully loaded into memory its performance resembles that of an ordered array or binary search tree, but its main advantage lies in storing the index on disk and separating index from data to keep the index size small.
Overall, B+ trees provide a balanced m‑ary tree structure that efficiently bridges the gap between massive disk‑resident data and fast memory‑resident indexes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
