How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes
This article explains how Elasticsearch leverages inverted indexes, term dictionaries, and compression techniques such as Frame‑of‑Reference and Roaring Bitmaps to enable rapid full‑text search, detailing the underlying data structures, query processing, and practical indexing tips for efficient backend search implementations.
Recently I have been working on several projects that use Elasticsearch (ES) to store data and perform search analysis, and I put together this technical sharing. The article does not discuss ES's distributed architecture or API usage; it focuses on how ES achieves fast retrieval.
About Search
Consider a scenario where we need to find ancient poems that contain the character "前". In a traditional relational database like MySQL we would write a query such as:
select name from poems where content like "%前%";This sequential scan must examine every record, which is inefficient and does not meet typical search expectations.
Search engines were created to solve this problem.
Search Engine Principles
A search engine generally follows these steps:
Content crawling and stop‑word filtering (e.g., removing common particles like "的", "了")
Tokenization to extract keywords
Building an inverted index from the keywords
User enters keywords to perform a search
The inverted index is the core technique that makes ES fast.
Inverted Index
After constructing the inverted index, a query for "前" can directly locate the matching poems.
Key concepts:
term : the keyword; ES calls it a term.
postings list : the list of document IDs that contain a given term.
Document IDs are stored as integers because integers compress well.
Each shard is divided into many smaller segments ; each segment can hold up to 2³¹ documents, each assigned a unique ID from 0 to 2³¹‑1.
In practice, ES does not store a raw inverted index; it uses Lucene’s implementation, which adds several layers of optimization.
Term Dictionary and Term Index
To locate a term among millions, ES builds a term dictionary that is sorted and searched via binary search, similar to a B‑tree index in MySQL. The dictionary is stored on disk, while a term index (a Finite State Transducer, FST) resides in memory for fast look‑ups.
Postings List Tricks
Postings lists need to address two main challenges:
Space consumption if not compressed.
Efficient set operations (intersection, union) for combined queries.
Compression is essential when a postings list contains millions of doc IDs. ES uses Frame of Reference (FOR) to delta‑encode ordered integer arrays. For example, the ID list [73, 300, 302, 332, 343, 372] becomes the delta list [73, 227, 2, 30, 11, 29]; each delta fits in a single byte.
ES groups documents into blocks of 256 IDs, encodes each block separately, and stores a small header indicating the bit width needed for that block.
For filter queries, ES caches results using Roaring Bitmaps . A roaring bitmap splits the 32‑bit doc ID space into high‑16‑bit buckets; each bucket stores its low‑16‑bit values using either an ArrayContainer (for < 4096 entries) or a BitmapContainer (for larger sets).
Union Queries
If a filter cache is available, ES simply performs bitmap AND/OR operations. Without a cache, ES uses a skip‑list algorithm to intersect multiple postings lists, skipping over blocks that cannot contain matching IDs and thereby avoiding unnecessary decompression.
Summary
ES uses an inverted index (term → term dictionary → postings list) to locate documents quickly, trading higher memory usage for much faster search.
Lucene stores the index as "term index → term dictionary → postings list" and compresses the term dictionary with an FST.
FOR compression dramatically reduces the space needed for postings lists.
Filter results are cached with Roaring Bitmaps, providing fast set operations with modest memory overhead.
For combined queries, ES prefers bitmap operations when a filter cache exists; otherwise it uses a skip‑list based intersection that also skips block decompression.
Elasticsearch Indexing Tips
Explicitly disable indexing for fields that do not need it; ES indexes fields by default.
For string fields that do not require analysis, set them to "keyword" type and disable analysis.
Avoid highly random IDs (e.g., UUIDs); use monotonic or patterned IDs to improve shard routing.
Choosing the right index structure for the specific workload is essential; different data structures excel at different query patterns.
Author: Richard_Yi Source: juejin.cn/post/6889020742366920712
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
