Understanding Elasticsearch Inverted Index: Posting Lists, Term Dictionary, and Compression Techniques
This article explains how Elasticsearch achieves fast search by using inverted indexes, detailing the structure of posting lists, term dictionaries, term indexes, and compression methods such as Frame of Reference and Roaring Bitmaps, as well as techniques for efficient union and intersection queries.
Recent projects often use Elasticsearch (ES) for storing and searching data. This article focuses on how ES can retrieve results quickly, rather than covering its distributed architecture or API usage.
1. Introduction
We start by comparing a simple search scenario—finding poems containing the character "前"—using a traditional relational database versus ES.
select name from poems where content like "%前%";In a relational database this requires a sequential scan of all rows, which is inefficient and does not support prefix matching (e.g., "A", "AB", "ABC"). Search engines like ES address these limitations.
Search Engine Principles
The search process can be summarized in four steps: crawling and stop‑word filtering, tokenization to extract keywords, building an inverted index, and processing user queries.
2. Inverted Index
The core of ES’s fast retrieval is the inverted index, which consists of three main components: postings list, term dictionary, and term index.
2.1 Key Concepts
Term
In ES a keyword is called a term .
Postings List
A postings list contains the IDs of all documents that contain a given term. For example, the term "前" might have a postings list represented as {静夜思, 望庐山瀑布} . Document IDs range from 0 to (2^31)-1 , with each segment holding up to 2^31 documents.
2.2 Internal Structure
The term dictionary stores all terms in sorted order, enabling binary search similar to a B+‑tree index in MySQL.
The term index is implemented as a trie (dictionary tree) and stored in memory as a Finite State Transducer (FST), allowing fast location of a term’s dictionary offset.
2.3 Compression Techniques
Frame of Reference (FOR)
Postings are divided into blocks of 256 documents. Within each block, IDs are delta‑encoded (each ID stored as the difference from the previous one) and the minimal bit‑width needed for the block is stored as a header. Example:
Original ID list: [73, 300, 302, 332, 343, 372]
Delta‑encoded list: [73, 227, 2, 30, 11, 29] . Since all values are < 255, each can be stored in a single byte.
Roaring Bitmaps (for filter cache)
For filter queries ES uses Roaring Bitmaps. Document IDs are split into high‑16‑bit and low‑16‑bit parts. High bits act as keys; low bits are stored in containers chosen based on cardinality:
ArrayContainer for len < 4096
BitmapContainer for larger lists
Threshold example: 2^16 = 65536 bits = 8 KB; storing 4 K values as 2‑byte integers also consumes 8 KB, so the container choice balances space.
2.4 Union / Intersection Queries
If a filter cache exists, ES directly uses bitmap AND/OR operations for fast set intersections/unions.
When no cache is available, ES traverses compressed postings lists using a skip‑list, which can jump over entire blocks, avoiding unnecessary decompression and reducing CPU cost.
3. Summary
Elasticsearch uses inverted indexes to locate target documents quickly, sacrificing memory for significant search speed gains.
Lucene stores the structure as term → term dictionary → postings list , compressing the term dictionary with FST.
FOR provides block‑level delta compression for postings, dramatically reducing disk usage.
Roaring Bitmaps cache filter results efficiently, balancing memory consumption and query speed.
Union queries leverage bitmap operations when possible; otherwise, skip‑list traversal minimizes decompression overhead.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.