How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes and Compression
This article explains how Elasticsearch uses inverted indexes, term dictionaries, FST structures, and compression techniques like FOR and Roaring Bitmaps to dramatically speed up search queries over massive datasets while minimizing memory and disk usage.
This article focuses on how Elasticsearch (ES) quickly retrieves data, without covering distributed technology or API usage.
About Search
Traditional relational databases require a sequential scan, e.g.:
select name from poems where content like "%前%";Such scans are inefficient for search scenarios, prompting the use of dedicated search engines like ES.
Search Engine Principles
Content crawling and stop‑word filtering
Tokenization to extract keywords
Building an inverted index
User query processing
ES is essentially a wrapper around Lucene, so the following discussion about inverted indexes reflects Lucene’s implementation.
Inverted Index
An inverted index maps terms (keywords) to a postings list —the set of document IDs containing that term. Document IDs are stored as integers, which can be efficiently compressed. Data is stored in segments within each shard; each segment can hold up to 2^31 documents, each assigned a unique ID.
Index Internal Structure
The term dictionary holds all terms in sorted order, enabling binary search. The term index is a trie (dictionary tree) that stores term prefixes, allowing fast location of a term’s dictionary offset. Lucene stores the term dictionary on disk in blocks with common‑prefix compression and keeps the term index in memory using a Finite State Transducer (FST), which offers low space usage and O(len(str)) lookup time.
Postings List Tricks
Compression (FOR – Frame of Reference) : Postings lists are sorted integer arrays, allowing delta‑encoding. For example, the ID list [73, 300, 302, 332, 343, 372] becomes [73, 227, 2, 30, 11, 29]; each delta fits in one byte. ES further groups documents into blocks of 256, encodes each block with the minimal bit width, and stores the bit‑width as a header.
For filter queries, ES uses Roaring Bitmaps to cache results. Bitmaps enable fast AND/OR operations but consume constant memory, so Roaring Bitmaps provide a compressed representation for sparse data.
Union Queries : When a filter cache is present, bitmaps are used directly for set operations. Without a cache, ES employs a skip‑list to traverse postings lists, skipping over irrelevant blocks and avoiding decompression of every block, thus reducing CPU cost.
Summary
ES uses inverted indexes to locate target documents quickly, trading higher space consumption for superior search performance.
Term dictionaries and FST‑based term indexes enable fast term lookup while minimizing memory and disk I/O.
FOR compression dramatically reduces postings‑list size.
Roaring Bitmaps accelerate filter queries with low memory overhead.
For union queries, bitmap operations are used when cached; otherwise, skip‑lists skip unnecessary traversals and decompression.
Elasticsearch Indexing Tips
Explicitly disable indexing for fields that do not need it.
Define non‑analyzed string fields to avoid unnecessary analysis.
Prefer predictable, low‑entropy IDs over random UUIDs for better query performance.
Author: Richard_Yi Source: juejin.cn/post/6889020742366920712
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
