How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes and Compression
This article explains how Elasticsearch uses inverted indexes, term dictionaries, postings lists, and compression techniques like Frame‑of‑Reference and Roaring Bitmaps to deliver rapid, memory‑efficient search, and provides practical tips for indexing and query optimization.
Recently I have been working on projects that use Elasticsearch (ES) for data storage and search analysis, so I studied ES. This article is a technical sharing that focuses on how ES achieves fast retrieval.
About Search
Imagine a scenario where we need to find ancient poems containing the character "前". Using a traditional relational database we would write a SQL query like:
<code>select name from poems where content like "%前%";</code>This sequential scan scans every record, which is inefficient for search.
Search Engine Principles
Content crawling and stop‑word filtering.
Tokenization to extract keywords.
Building an inverted index from the keywords.
User enters keywords to perform the search.
The core of this process is the inverted index, which ES implements via Lucene.
Inverted Index
After building the inverted index, a query for "前" can directly locate matching poems. The index consists of three main components:
Term : the keyword (called a term in ES).
Postings list : the list of document IDs that contain the term.
Term dictionary and term index : a sorted dictionary (often a B‑tree or Trie) that maps terms to their postings.
Lucene stores the term dictionary on disk in blocks with prefix compression and keeps the term index in memory as a Finite State Transducer (FST) for fast lookup.
Techniques for Postings List
Postings lists need to be compressed and support fast union/intersection queries.
Compression (FOR)
Lucene uses Frame‑of‑Reference (FOR) to delta‑encode ordered integer IDs. IDs are grouped into blocks (e.g., 256 docs per block); each block stores the bit width needed for the deltas as a header, achieving 1‑64 bit storage per ID.
Roaring Bitmaps (Filter Cache)
For filter queries ES caches results using Roaring Bitmaps, which store sparse bitmaps efficiently and allow fast bitwise AND/OR operations.
Union/Intersection
If a filter cache is present, bitmaps are used directly for AND/OR. Otherwise, ES uses a skip‑list algorithm to intersect multiple postings lists, skipping over non‑matching IDs and avoiding decompression of entire blocks.
Summary
ES uses inverted indexes (term → dictionary → postings) to locate documents quickly, trading higher space usage for much faster search.
Lucene’s term index, term dictionary, and FST compression enable fast in‑memory lookups.
FOR compression dramatically reduces postings list size.
Roaring Bitmaps cache filter results, saving memory and speeding up frequent filters.
When a filter cache exists, bitwise operations handle unions/intersections; otherwise, skip‑list traversal reduces CPU cost.
Elasticsearch Indexing Tips
Explicitly disable indexing for fields that do not need it; ES indexes fields by default.
For string fields that do not require analysis, set them to not_analyzed; analysis is enabled by default.
Prefer predictable, low‑entropy IDs; random UUIDs hinder query performance.
Author: Richard_Yi (source: juejin.cn)
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.