Databases 17 min read

How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes

This article explains how Elasticsearch leverages inverted indexes, term dictionaries, and advanced compression techniques like Frame of Reference and Roaring Bitmaps to enable rapid full‑text search, covering the underlying concepts, data structures, and query optimizations essential for high‑performance search applications.

Java Interview Crash Guide

Jul 2, 2021

How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes

Preface

Recently I have been working on several projects that use Elasticsearch (ES) for data storage and search analysis, so I studied ES. This article is based on a technical sharing session.

The focus is not on ES's distributed architecture or API usage, but on how ES achieves fast retrieval, which was the part I was most interested in.

The article covers:

About search

Differences between traditional relational databases and ES

Search engine principles

Deep dive into inverted indexes

Structure of posting list → term dictionary → term index

Techniques for postings list (FOR, Roaring Bitmaps)

How to perform fast union/intersection queries

About Search

Consider a scenario: searching for ancient poems containing the character “前”.

How does the implementation differ between a relational database and ES?

In MySQL you would use a sequential scan:

select name from poems where content like "%前%";

This requires scanning all records, which is inefficient and does not support prefix matching.

Search engines like ES solve this with inverted indexes.

Search Engine Principles

The basic steps are:

Content crawling and stop‑word filtering

Tokenization and keyword extraction

Building an inverted index

User query processing

The core concept is the inverted index, implemented by Lucene.

Elasticsearch is essentially a wrapper around Lucene; the inverted index implementation relies on Lucene's API.

Inverted Index

After building the inverted index, a query for “前” can directly locate matching poems.

The detailed structure consists of term dictionary, term index, and postings list.

Key Concepts

term

A keyword in ES is called a term .

postings list

A postings list is the set of document IDs that contain a specific term, e.g. {静夜思, 望庐山瀑布}. IDs are stored as integers for efficient compression.

Each shard stores data in segments, each segment can hold up to 2^31 documents, with IDs ranging from 0 to (2^31‑1).

Index Internal Structure

To handle massive numbers of terms, ES uses a term dictionary (sorted list) and term index (a trie/FST) for fast lookup.

The term index is stored in memory using a Finite State Transducer (FST), while the term dictionary is block‑compressed with shared prefixes.

Space‑efficient due to prefix compression

Fast O(len(str)) lookup

Compression Techniques

Frame of Reference (FOR)

Postings lists are stored as ordered integer arrays and delta‑encoded. Blocks of 256 documents are encoded with the minimal number of bits needed, drastically reducing storage size.

Roaring Bitmaps

For filter caches, ES uses Roaring Bitmaps, which compress sparse bitmaps by grouping high‑order bits and using different containers (ArrayContainer or BitmapContainer) based on cardinality.

This reduces disk usage while allowing fast bitwise AND/OR operations.

Union/Intersection Queries

If a filter cache exists, the bitmap is used directly for AND/OR. Otherwise, a skip‑list traverses the compressed postings lists, selecting the shortest list and intersecting with others, skipping non‑matching IDs.

Skipping reduces both traversal and decompression costs, saving CPU.

Summary

ES uses inverted indexes to speed up document retrieval, trading higher space for much faster search.

Lucene stores the structure as term index → term dictionary → postings list, with FST compression for in‑memory lookup.

FOR compression dramatically reduces postings list size on disk.

Roaring Bitmaps cache filter results, balancing speed and storage.

Union queries use bitmaps when cached, otherwise skip lists to intersect postings lists efficiently.

Elasticsearch Indexing Tips

Explicitly disable indexing for fields that do not need it.

Define non‑analyzed string fields explicitly.

Avoid random IDs (e.g., UUID) that hinder query performance.

References

https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps

https://www.elastic.co/cn/blog/found-elasticsearch-from-the-bottom-up

http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html

https://www.infoq.cn/article/database-timestamp-02

https://zhuanlan.zhihu.com/p/137574234

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch Lucene inverted index compression Postings List

Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.