Databases 14 min read

Understanding Elasticsearch Fast Retrieval: Inverted Index, Postings List, and Compression Techniques

This article explains how Elasticsearch achieves rapid search by using inverted indexes, detailing the structure of posting lists, term dictionaries, compression methods like Frame‑of‑Reference and Roaring Bitmaps, and how these techniques enable efficient union queries and filter caching.

IT Architects Alliance

Sep 5, 2021

Understanding Elasticsearch Fast Retrieval: Inverted Index, Postings List, and Compression Techniques

Recently I have been working on several projects that use Elasticsearch (ES) for data storage and search analysis, and this article shares the knowledge I gathered during a technical talk, focusing on how ES can retrieve data quickly.

About Search

Consider a scenario where we need to find ancient poems containing the character "前". Using a traditional relational database such as MySQL, the query would be a sequential scan: select name from poems where content like "%前%"; This approach scans every record, which is inefficient for search workloads.

Search Engine Principles

Search engines typically follow four steps: crawling and stop‑word filtering, tokenization to extract keywords, building an inverted index, and finally matching user queries against the index.

Inverted Index

ES is essentially a wrapper around Lucene, and its inverted index consists of a term dictionary, a term index, and posting lists. The term dictionary maps terms to their IDs, the term index (implemented as an FST) quickly locates dictionary entries, and posting lists store the document IDs that contain each term.

Postings List Tricks

Posting lists can become large, so ES compresses them using techniques such as Frame‑of‑Reference (FOR) and Roaring Bitmaps. FOR encodes sorted integer IDs as deltas within blocks, reducing the number of bits needed per block. Roaring Bitmaps store sparse bitmaps efficiently by splitting IDs into high‑16‑bit keys and low‑16‑bit containers, choosing between ArrayContainer and BitmapContainer based on cardinality.

Union Queries

If a filter cache is available, ES uses bitmap operations (AND/OR) to compute intersections or unions of posting lists. Without a cache, a skip‑list algorithm traverses the compressed posting lists on disk, skipping blocks that cannot contain matching IDs, thus saving CPU cycles.

Summary

1. ES uses an inverted index to locate target documents quickly, trading higher space consumption for much faster search.

2. Term dictionaries and FST‑based term indexes enable fast term lookup while keeping memory usage low.

3. FOR compression dramatically reduces the storage size of posting lists.

4. Filter results are cached with Roaring Bitmaps, providing fast, low‑memory filter queries.

5. Union queries leverage bitmap operations when possible, otherwise they fall back to skip‑list traversal to avoid unnecessary decompression.

Elasticsearch Indexing Tips

Move as much data as possible into memory, minimize random disk reads, and apply aggressive compression. Explicitly disable indexing for fields that do not need it, define non‑analyzed string fields when appropriate, and prefer predictable IDs over highly random ones (e.g., avoid UUIDs) for better query performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch Lucene inverted index compression Roaring Bitmap Postings List

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.