Understanding Elasticsearch Inverted Index: Fast Retrieval, Compression, and Query Techniques
This article explains how Elasticsearch uses inverted index structures—including term dictionaries, term indexes, and postings lists—combined with compression methods like Frame‑of‑Reference and Roaring Bitmaps to achieve fast search, efficient storage, and effective union queries compared to traditional relational databases.
Recent projects have used Elasticsearch (ES) for data storage and search, prompting a deep dive into how ES achieves rapid retrieval without focusing on its distributed architecture or API usage.
The article first contrasts traditional relational database scans with ES's inverted index approach, illustrating a simple SQL example:
select name from poems where content like "%前%";
It then outlines the basic steps of a search engine: crawling, stop‑word filtering, tokenization, building an inverted index, and query processing.
The core of ES's search speed lies in its inverted index, which consists of a term dictionary, a term index (implemented as a Finite State Transducer), and postings lists. Terms (keywords) map to document IDs, and these IDs are stored efficiently using compression.
Two main compression techniques are discussed:
Frame‑of‑Reference (FOR) encodes ordered integer doc IDs as deltas within fixed‑size blocks, dramatically reducing storage.
Roaring Bitmaps are used for filter caches, allowing fast bitmap operations while keeping memory usage low.
For union queries, ES first checks for a cached filter bitmap; if unavailable, it employs a skip‑list algorithm to intersect postings lists, skipping unnecessary blocks and avoiding decompression overhead.
Practical ES indexing tips are provided: explicitly disable indexing for unused fields, define non‑analyzed string fields, and prefer predictable IDs over random UUIDs.
In summary, ES leverages Lucene's inverted index—term dictionary → term index → postings list—augmented by FST compression, FOR block compression, and Roaring Bitmap caching to deliver high‑performance search while managing memory and disk usage efficiently.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.