How Elasticsearch Uses Lucene’s Inverted Index for Lightning‑Fast Search
This article explains how Elasticsearch leverages Lucene’s inverted index, detailing the structure of term dictionaries, postings lists, compression techniques like Frame‑of‑Reference and Roaring Bitmaps, and query optimizations such as filter caches and skip‑list intersections to achieve fast, memory‑efficient search.
"All problems in computer science can be solved by another level of indirection." – David J. Wheeler "计算机世界就是 trade-off 的艺术"
Preface
Recent projects I worked on use Elasticsearch (ES) for data storage and search analytics, so I studied ES and prepared this technical sharing.
This article does not focus on ES’s distributed architecture or API usage; it concentrates on how ES can retrieve data quickly, which was my main interest before learning ES.
1. About Search
Differences between traditional relational databases and ES
Search engine principles
2. About Search
Imagine a scenario where we need to search ancient poems containing the character "前".
Using a traditional RDBMS like MySQL, we would write a SQL query such as:
select name from poems where content like "%前%";This approach requires a sequential scan of all records, which is inefficient and does not meet typical search expectations (e.g., returning results for "A", "AB", "ABC", etc.). Hence dedicated search engines like ES exist.
Search Engine Principles
The basic steps of a search engine are:
Content crawling and stop‑word filtering
Tokenization and keyword extraction
Building an inverted index
User query processing
The core concept introduced here is the inverted index , which is the heart of ES.
ES is essentially a wrapper around Lucene; the implementation of the inverted index is provided by Lucene’s API.
3. Inverted Index
After building the inverted index, the earlier search requirement becomes a simple lookup: entering "前" directly locates matching poems.
1. Key Concepts
term
In ES, a keyword is called a term .
postings list
For the term "前", the postings list might be {静夜思, 望庐山瀑布}, representing the set of document IDs containing that term. ES stores these IDs as integers for efficient compression.
Each shard contains multiple segments; each segment can hold up to 2^31 documents, with IDs ranging from 0 to (2^31)-1.
All terminology is taken from the official ES documentation.
2. Internal Structure
In production, ES must handle massive numbers of terms. To locate a term quickly, ES builds a term dictionary sorted for binary search, similar to a B+ tree index in MySQL.
The term index is implemented as a Trie (or prefix tree), allowing fast navigation to a dictionary offset.
Lucene stores the term dictionary on disk in blocks with common prefix compression , and keeps the term index in memory using a Finite State Transducer (FST), which offers small space usage and O(len(str)) lookup time.
Overall inverted index structure:
4. Tricks for Postings List
Postings lists face two main challenges:
Uncompressed postings lists consume a lot of disk space.
Intersection queries need fast computation of AND/OR across multiple lists.
Compression is essential when dealing with millions of document IDs. ES uses two main techniques:
1. Compression
Frame of Reference
Lucene stores postings lists as ordered integer arrays, enabling delta‑encoding (incremental encoding). For example, the ID list [73, 300, 302, 332, 343, 372] becomes [73, 227, 2, 30, 11, 29], where each delta fits in a single byte.
ES further groups documents into blocks of 256, computes the minimal bit width needed for each block, and stores that width as a header. This technique is called Frame of Reference (FOR) .
After final bit‑level compression, integer types expand from the fixed 8/16/32/64‑bit to a flexible 1‑64‑bit range, dramatically reducing space.
Roaring Bitmaps (for filter cache)
Filters in ES return only a boolean match/no‑match result and can be cached. The filter cache stores the set of matching document IDs using a compressed bitmap.
For filter caches, Lucene uses Roaring Bitmaps , which split IDs into high‑16‑bit and low‑16‑bit parts, aggregating by high bits and storing low bits in either an ArrayContainer (if length < 4096) or a BitmapContainer (if length ≥ 4096). This approach balances memory usage and query speed.
2. Intersection Queries
If a filter cache is available, ES directly uses the cached bitmap to compute AND/OR operations.
Without a cache, ES traverses postings lists using a skip‑list algorithm: it selects the shortest list, then seeks matching IDs in the other lists, skipping over non‑matching ranges thanks to the FOR‑encoded blocks.
FOR encoding stores each block’s header, allowing the skip‑list to avoid decompressing irrelevant blocks and thus saving CPU cycles.
5. Summary
Technical summary:
ES uses an inverted index to locate target documents quickly, sacrificing some space for significant search performance gains.
Lucene’s term dictionary, term index, and postings list are compressed in memory with an FST, enabling fast term lookup.
FOR (Frame of Reference) compresses postings lists on disk, achieving notable space reduction.
Filter queries are cached using Roaring Bitmaps, providing high‑frequency filter speed while reducing storage.
When filter cache is unavailable, ES intersects postings lists via a skip‑list, skipping traversal and decompression costs.
Elasticsearch Indexing Approach
Move as much data from disk to memory as possible, reduce random disk reads, and combine various compression algorithms with a strict memory‑first mindset.
When using Elasticsearch, keep in mind:
Explicitly disable indexing on fields that do not need it; ES indexes fields by default.
For string fields that do not require analysis, also define them explicitly; analysis is enabled by default.
Choose IDs with regular patterns; highly random IDs (e.g., UUIDs) hinder query performance.
Technology choices always depend on business scenarios; each database solves specific problems with its own data structures and indexing strategies.
Hope this article brings you some insights.
https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps
https://www.elastic.co/cn/blog/found-elasticsearch-from-the-bottom-up
http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
https://www.infoq.cn/article/database-timestamp-02
https://zhuanlan.zhihu.com/p/137574234
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
