Databases 24 min read

Can Machine Learning Replace Traditional Hashing? A Deep Dive into Learned Indexes

This article explores the evolution of indexing from classic hash tables and B‑Trees to learned index structures, explaining hash functions, collision handling, machine‑learning fundamentals, and the promise and limits of using ML models to improve memory efficiency and query performance.

ITPUB

Jun 19, 2018

Can Machine Learning Replace Traditional Hashing? A Deep Dive into Learned Indexes

What Is an Index?

Indexes improve the speed of data retrieval by mapping keys to physical storage locations, a concept dating back to ancient library catalogs and modern database systems. In computers, an index translates a key into an address where the corresponding data resides, enabling fast look‑ups.

What Is a Hash Table?

A hash table uses a hash function to convert a key into an integer (hash code) that determines the array slot where the value is stored. Insertion stores a key‑value pair; lookup recomputes the hash code and accesses the array position.

Performance Considerations of Hashing

Hash collisions occur when different keys produce the same hash code. Simple example functions illustrate this:

function hashFunction(key) { return (key * 13) % sizeOfArray; }

Choosing a good hash function reduces collisions. For a table of size 16, compare two functions:

function hash_a(key) { return (13 * key) % 16; }

function hash_b(key) { return (4 * key) % 16; }

Running a script over keys 0‑31 shows that hash_a distributes collisions evenly (16 total), while hash_b creates 28 collisions because 4 shares a factor with the table size.

Typical collision‑resolution strategies are:

Chaining : each array slot holds a linked list of entries that hash to the same bucket.

Linear probing : on a collision, the algorithm searches the next free slot sequentially.

Images illustrate how chaining creates longer lists, while linear probing fills consecutive slots, affecting cache performance.

Machine‑Learning Basics

Machine learning builds statistical models that map input vectors to output labels or values. Examples include predicting university admission (inputs: GPA, SAT, etc.) or mortgage default risk (inputs: credit score, income, etc.). Models are trained on data and then used for inference.

Unlike handcrafted algorithms, ML models automatically learn patterns from raw data, though they can overfit to training data and may not generalize well.

Viewing an Index as a Model

Researchers at Google and MIT proposed treating an index as a learned model: the model takes a query key as input and outputs the memory address where the data resides. This reframes indexing as a prediction problem.

Challenges include the need for exact address predictions and handling unseen keys, which differ from typical ML tasks that tolerate some error.

Learning Hash Functions

Replacing traditional hash functions with ML‑trained models can exploit data distribution, reducing space waste. Experiments show that learned hash functions achieve higher memory utilization (≈50% improvement) at the cost of slower computation and a training phase.

Cuckoo Hashing

Cuckoo hashing, introduced in 2001, resolves collisions by maintaining two tables with two hash functions. When a collision occurs, the existing entry is displaced (“evicted”) to the alternate table, possibly triggering further evictions. If eviction cycles exceed a threshold, the table is rebuilt with new hash functions.

Images depict the eviction process and demonstrate that cuckoo hashing remains efficient even at high load factors (up to 99%).

What’s Next for Indexing?

Learned indexes show promise, especially as ML hardware (e.g., TPUs) accelerates training and inference. However, classic algorithms like cuckoo hashing still offer strong performance without the overhead of model training. Future database systems may blend learned components with proven data‑structure techniques to push efficiency boundaries.