Understanding Elasticsearch Fast Retrieval: Inverted Index, Postings List, and Compression Techniques
This article explains how Elasticsearch achieves rapid search by using inverted indexes, detailing the structure of posting lists, term dictionaries, compression methods like Frame‑of‑Reference and Roaring Bitmaps, and how these techniques enable efficient union queries and filter caching.
Recently I have been working on several projects that use Elasticsearch (ES) for data storage and search analysis, and this article shares the knowledge I gathered during a technical talk, focusing on how ES can retrieve data quickly.
About Search
Consider a scenario where we need to find ancient poems containing the character "前". Using a traditional relational database such as MySQL, the query would be a sequential scan:
select name from poems where content like "%前%";This approach scans every record, which is inefficient for search workloads.
Search Engine Principles
Search engines typically follow four steps: crawling and stop‑word filtering, tokenization to extract keywords, building an inverted index, and finally matching user queries against the index.
Inverted Index
ES is essentially a wrapper around Lucene, and its inverted index consists of a term dictionary, a term index, and posting lists. The term dictionary maps terms to their IDs, the term index (implemented as an FST) quickly locates dictionary entries, and posting lists store the document IDs that contain each term.
Postings List Tricks
Posting lists can become large, so ES compresses them using techniques such as Frame‑of‑Reference (FOR) and Roaring Bitmaps. FOR encodes sorted integer IDs as deltas within blocks, reducing the number of bits needed per block. Roaring Bitmaps store sparse bitmaps efficiently by splitting IDs into high‑16‑bit keys and low‑16‑bit containers, choosing between ArrayContainer and BitmapContainer based on cardinality.
Union Queries
If a filter cache is available, ES uses bitmap operations (AND/OR) to compute intersections or unions of posting lists. Without a cache, a skip‑list algorithm traverses the compressed posting lists on disk, skipping blocks that cannot contain matching IDs, thus saving CPU cycles.
Summary
1. ES uses an inverted index to locate target documents quickly, trading higher space consumption for much faster search.
2. Term dictionaries and FST‑based term indexes enable fast term lookup while keeping memory usage low.
3. FOR compression dramatically reduces the storage size of posting lists.
4. Filter results are cached with Roaring Bitmaps, providing fast, low‑memory filter queries.
5. Union queries leverage bitmap operations when possible, otherwise they fall back to skip‑list traversal to avoid unnecessary decompression.
Elasticsearch Indexing Tips
Move as much data as possible into memory, minimize random disk reads, and apply aggressive compression. Explicitly disable indexing for fields that do not need it, define non‑analyzed string fields when appropriate, and prefer predictable IDs over highly random ones (e.g., avoid UUIDs) for better query performance.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.