Databases 11 min read

Learned Index Structures: Applying Machine Learning to Database Indexing

Learned Index Structures replace conventional B‑Tree and Bloom filter indexes with hierarchical machine‑learning models that predict key positions directly, reducing search overhead, but require careful model selection, training, and handling of updates, making them a promising yet still experimental alternative to traditional database indexing techniques.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Learned Index Structures: Applying Machine Learning to Database Indexing

Machine learning, especially deep learning, has become pervasive in daily life. Frameworks such as TensorFlow and PyTorch, together with AI‑specific chips like TPUs and NPUs, have dramatically improved performance and broadened application scenarios. An emerging question is whether machine learning methods can be used to optimize computer system design, even replacing traditional design patterns.

In 2018, Jeff Dean’s research group presented “The Case for Learned Index Structures” at SIGMOD, introducing the concept of replacing classic index structures (e.g., B‑Tree, Bloom Filter) with learned models that predict the position of a key.

Traditional indexes such as B‑Tree are range indexes: given a key, the tree navigates to a leaf node that contains the key’s range, then performs a search within that leaf. Point indexes (hash tables) and existence indexes (Bloom filters) serve other purposes. Learned Index Structures replace these data structures with machine‑learning models, turning the lookup operation into a prediction of the data position based on the key.

Figure 1 shows a traditional B‑Tree compared with a Learned Index. The Learned Index takes a key as input and outputs an estimated position (with bounded error equal to the size of a leaf page).

For range indexes, the key‑to‑position relationship can be modeled as a monotonic increasing function. The estimated position p can be expressed as:

p = F(Key) * N

where N is the total number of keys and F(Key) is the cumulative distribution function (CDF) of the keys. This captures the data distribution and allows the model to predict positions directly, reducing the search space.

Early attempts used a two‑layer fully connected neural network (32 neurons per layer) trained on web‑server logs, but performance lagged behind B‑Tree due to:

TensorFlow’s overhead dominates in small‑scale inference scenarios.

Under‑fitting: the model captures the overall CDF trend but cannot precisely predict individual keys, requiring additional binary search within the error bound.

Higher CPU and cache usage: each prediction requires loading all model parameters.

Consequently, the final design adopts a Staged Model (Recursive Model Index). Each stage can be any model—from simple linear regression to deep neural networks—but simpler models are preferred to keep lookup latency low. The top‑level model selects a second‑level model based on its prediction; the second‑level model selects a third‑level model, and so on, until the bottom model outputs the final position. The selection rule is:

model_id = (predicted_position / total_records) * number_of_models_in_this_layer

Training proceeds top‑down: the top model is trained first, then data points are dispatched to lower‑level models based on the top model’s predictions. Lower‑level models receive fewer data points, making it easier for simple models (e.g., linear regression) to fit the local distribution, though limited data can restrict model complexity.

The paper also notes that learned indexes are not a universal replacement for traditional structures. For example, spline B‑Trees store a single spline per leaf and can match or exceed learned index performance, while bucketized cuckoo hashing improves point indexes with minimal changes.

Overall, Learned Index Structures demonstrate the potential of machine learning in system design, though challenges such as over‑fitting, dynamic updates, and resource overhead remain. Solutions like online learning, delta‑indexes, and hybrid approaches are active research areas.

References:

Kraska et al., “The Case for Learned Index Structures”, SIGMOD 2018.

RMI implementation: https://github.com/learnedsystems/RMI

Various discussion articles and surveys on learned indexes and related system optimizations.

Machine LearningDatabaseIndex Structureslearned indexStaged Model
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.