Mastering MySQL Indexes: B‑Tree, B+Tree, and Optimization Strategies
This article provides a comprehensive overview of MySQL indexing, covering B‑Tree and B+Tree structures, common search algorithms, storage engine differences, and practical optimization techniques such as composite and prefix indexes to improve query performance.
1. Introduction
This article uses MySQL as the research object to discuss various topics related to database indexes. MySQL supports many storage engines, each with different index support, and therefore multiple index types such as B‑Tree, hash, and full‑text. To avoid confusion, the focus is on B‑Tree indexes, which are the most commonly used in MySQL.
2. Common Search Algorithms and Data Structures
Indexes are built to create a data structure on which efficient search algorithms can be applied, ultimately speeding up data retrieval.
2.1 Essence of Indexes
MySQL defines an index as a data structure that helps MySQL retrieve data efficiently. In short, an index is a data structure .
2.2 Common Search Algorithms
Database queries aim for the fastest possible speed, so designers optimize from the algorithmic perspective. The following algorithms are commonly used:
2.2.1 Linear Search
The most basic algorithm compares each element sequentially, which is inefficient for large data sets. Data structure: ordered or unordered list Complexity:
O(n)// Linear Search
int SequenceSearch(int a[], int value, int n) {
int i;
for(i=0; i<n; i++)
if(a[i]==value)
return i;
return -1;
}2.2.2 Binary Search
Binary search improves on linear search by repeatedly halving the search interval. Data structure: sorted array Complexity:
O(log n)// Recursive binary search
int BinarySearch2(int a[], int value, int low, int high) {
int mid = low + (high - low) / 2;
if(a[mid] == value)
return mid;
if(a[mid] > value)
return BinarySearch2(a, value, low, mid-1);
return BinarySearch2(a, value, mid+1, high);
}2.2.3 Binary Search Tree Lookup
Characteristics of a binary search tree:
If the left subtree is not empty, all node values are less than the root.
If the right subtree is not empty, all node values are greater than the root.
Both subtrees are themselves binary search trees.
Search principle:
If the tree is empty, the search fails.
If the target equals the root’s key, the search succeeds.
If the target is less than the root’s key, search the left subtree.
Otherwise, search the right subtree.
Data structure: binary search tree Time complexity:
O(log2N)2.2.4 Hash Table
The hash method creates a hash table using a key and hash function to locate data elements.
Data structure: hash table Time complexity: O(1) (depends on collisions).
2.2.5 Block Search
Block search (indexed sequential search) improves linear search by dividing n elements into m ordered blocks. Each block’s elements need not be ordered, but blocks themselves are ordered.
Algorithm flow:
Select the maximum key of each block to form an index table.
First search the index table (binary or linear) to locate the block, then perform sequential search within that block.
This method halves the search range each comparison, greatly improving speed. Different algorithms require specific data structures; for example, binary search needs ordered data, while binary tree search needs a binary search tree.
2.3 Balanced Multi‑way Search Tree (B‑Tree)
Binary trees have a search complexity of O(log2N), which depends on tree depth. To reduce depth, multi‑branch trees combined with balanced binary tree ideas are used, forming a balanced multi‑way tree that enables efficient search on large data sets.
2.3.1 B‑Tree
A B‑Tree (also called B‑Tree) is a balanced multi‑way search tree. Below is a typical B‑Tree diagram.
We define a record as a tuple [key, data]. The following is a detailed definition:
1. There is a root node that contains either one record and two children, or the root is empty.
2. In each node, keys and pointers alternate; pointers point to child nodes.
3. Let d be the tree’s width. Except for leaf nodes, each node contains [d/2, d‑1] records and [d/2+1, d] children.
4. In a node, the n‑th subtree’s keys are greater than the (n‑1)‑th key and less than the n‑th key.
5. All leaf nodes reside at the same depth.Search in a B‑Tree proceeds by binary search at each node, recursively descending until the key is found or a null pointer is reached.
BTree_Search(node, key) {
if(node == null) return null;
foreach(node.key) {
if(node.key[i] == key) return node.data[i];
if(node.key[i] > key) return BTree_Search(point[i]->node);
}
return BTree_Search(point[i+1]->node);
}
data = BTree_Search(root, my_key);B‑Tree properties include a height upper bound log_d((N+1)/2) and search node count complexity O(log_d N), making it an efficient index structure.
2.3.2 B+Tree
MySQL commonly uses B+Tree for indexes. Compared with B‑Tree, B+Tree differs as follows:
Each node’s pointer limit is 2d instead of 2d+1.
Internal nodes store only keys, not data.
Leaf nodes do not store pointers.
Below is a simple B+Tree illustration.
Because leaf and internal nodes differ in size, B+Tree is more suitable for external storage indexes.
2.3.3 B+Tree with Sequential Access Pointers
Many database and file systems add sequential access pointers to leaf nodes of a classic B+Tree to improve range query performance.
With these pointers, a range query (e.g., keys 18 to 49) can locate the start key and then traverse leaf nodes sequentially, greatly enhancing interval query efficiency.
Consider a table with four rows where Id is the primary key and Name is a secondary key; the diagram shows the difference between clustered and non‑clustered indexes.
Clustered indexes store row data together with the primary key, allowing immediate data retrieval, while secondary indexes reference the primary key, avoiding address updates on row moves.
3. Computer Principles Related to Index Data Structures
Both B‑Tree and B+Tree are widely used for indexes; this section discusses the underlying computer architecture principles.
3.1 Two Types of Storage
Computer systems have main memory (RAM) and external storage (disk, SSD, etc.). RAM provides fast random access, while external storage is orders of magnitude slower due to mechanical movement.
Indexes are usually stored on disk, so index lookups involve disk I/O. The key metric for an index structure is the number of disk I/O operations required during a search.
3.2 Main Memory Access Principle
RAM consists of a matrix of storage cells, each with a unique address. Access time is linear with the number of accesses, independent of the physical distance between cells.
3.3 Disk Access Principle
Disk reads involve mechanical motion: seek time (moving the head), rotation delay, and data transfer. Typical seek time is ~5 ms, rotation delay for a 7200 RPM disk is about 4.17 ms, and transfer time is negligible, resulting in roughly 9 ms per I/O.
3.4 Locality Principle and Disk Prefetch
Because disk I/O is costly, disks perform prefetching, reading a whole page (often 4 KB) sequentially after the requested byte, based on the principle of locality: data near a accessed item is likely to be accessed soon.
4. Performance Analysis of B‑Tree and B+Tree in Database Indexes
Analyzing B‑Tree, a search may visit at most h‑1 nodes (the root stays in memory). By aligning a node’s size with a page, each node requires a single I/O.
Thus, B‑Tree search I/O count is at most h‑1, with complexity O(h)=O(log_d N). In practice, the degree d is large (often > 100), making the height h usually ≤ 3.
4.1 B+Tree Performance Analysis
B+Tree increases the degree d because leaf nodes store only keys, allowing more keys per node. The maximum degree is:
dmax = floor(pageSize / (keySize + dataSize + pointerSize)) // floor means round downThis larger degree reduces tree height and thus I/O operations.
4.2 B+Tree Lookup Process
To find key 29, the system loads disk block 1 (first I/O), performs binary search to locate the pointer to block 2, loads block 2 (second I/O), continues the search, then loads block 8 (third I/O) where the key is found. Three I/Os suffice to locate a record among millions, whereas a full table scan would require millions of I/Os.
5. MySQL Index Implementation
Indexes in MySQL are storage‑engine specific.
5.1 MyISAM Index Implementation
MyISAM uses B+Tree for indexes; leaf nodes store the address of the data record.
The primary index (unique) and secondary indexes share the same B+Tree structure, differing only in uniqueness. Secondary indexes also store record addresses.
5.2 InnoDB Index Implementation
InnoDB also uses B+Tree, but the table’s data file itself is a B+Tree. The leaf nodes contain full row data, making the primary index a clustered index.
InnoDB requires a primary key; if none is defined, MySQL creates a hidden 6‑byte integer key.
Secondary indexes store the primary key value in their leaf nodes instead of the row address.
This design explains why long primary keys inflate secondary indexes and why monotonic primary keys (e.g., auto‑increment) are preferred.
6. Index Usage Strategies and Optimization
Optimization consists of structural (schema) and query optimizations. Understanding index mechanisms enables logical reasoning about high‑performance strategies.
6.1 Composite Indexes and Leftmost Prefix Principle
Composite Indexes
A composite index orders rows first by the first column, then by the second, and so on. The leftmost column is always ordered, and when its values are equal, the next column is ordered.
The first column is always sorted.
If the first column’s values are equal, the second column is sorted, etc.
This behavior mirrors dictionary lookup and follows the leftmost‑prefix rule: you can use the index for queries that filter on the first column alone, the first two columns, or all three, but not for queries that skip the leftmost column.
Leftmost Prefix Principle
Examples:
SELECT * FROM table WHERE a=1; SELECT * FROM table WHERE a=1 AND b=2; SELECT * FROM table WHERE a=1 AND b=2 AND c=3;All can use the composite index (a,b,c). Queries like WHERE a=1 AND c=3 use only column a. Queries that start with b or c cannot use the index unless the optimizer reorders them, which it may do, but writing queries in index order is best practice.
Prefix Indexes
Prefix indexes store only the leading part of a column as the key, reducing index size while retaining selectivity when the prefix is sufficiently discriminative. Suitable scenarios include long VARCHAR/TEXT columns where the first few characters differ significantly (e.g., foreign names, email addresses). Prefix indexes cannot be used for ORDER BY, GROUP BY, or covering indexes.
6.2 Index Optimization Strategies
Follow the leftmost‑prefix rule.
Always index foreign‑key columns.
Index columns appearing in WHERE, ON, GROUP BY, ORDER BY.
Prefer high‑cardinality columns; calculate selectivity as COUNT(DISTINCT col) / COUNT(*).
Use small data types for indexed columns to keep index size low.
Avoid functions on indexed columns; rewrite expressions to compare raw column values.
Apply prefix indexes to long string columns.
Extend existing indexes instead of creating new ones when possible.
Limit the total number of indexes to balance read performance against DML overhead.
For LIKE queries, avoid leading wildcards; use col LIKE 'prefix%' to enable index usage.
Ensure data type consistency in predicates; mismatched types prevent index usage.
Reference articles:
http://blog.csdn.net/suifeng3051/article/details/49530299?locationNum=1 http://tech.meituan.com/mysql-index.html https://yq.aliyun.com/articles/39841 http://blog.csdn.net/lovelion/article/details/8462814
Source: https://www.cnblogs.com/chihirotan/p/7486035.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
