Big Data 14 min read

How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes and Compression

This article explains how Elasticsearch uses inverted indexes, term dictionaries, FST structures, and compression techniques like FOR and Roaring Bitmaps to dramatically speed up search queries over massive datasets while minimizing memory and disk usage.

21CTO

Sep 12, 2021

How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes and Compression

This article focuses on how Elasticsearch (ES) quickly retrieves data, without covering distributed technology or API usage.

About Search

Traditional relational databases require a sequential scan, e.g.:

select name from poems where content like "%前%";

Such scans are inefficient for search scenarios, prompting the use of dedicated search engines like ES.

Search Engine Principles

Content crawling and stop‑word filtering

Tokenization to extract keywords

Building an inverted index

User query processing

ES is essentially a wrapper around Lucene, so the following discussion about inverted indexes reflects Lucene’s implementation.

Inverted Index

An inverted index maps terms (keywords) to a postings list —the set of document IDs containing that term. Document IDs are stored as integers, which can be efficiently compressed. Data is stored in segments within each shard; each segment can hold up to 2^31 documents, each assigned a unique ID.

Index Internal Structure

The term dictionary holds all terms in sorted order, enabling binary search. The term index is a trie (dictionary tree) that stores term prefixes, allowing fast location of a term’s dictionary offset. Lucene stores the term dictionary on disk in blocks with common‑prefix compression and keeps the term index in memory using a Finite State Transducer (FST), which offers low space usage and O(len(str)) lookup time.

Postings List Tricks

Compression (FOR – Frame of Reference) : Postings lists are sorted integer arrays, allowing delta‑encoding. For example, the ID list [73, 300, 302, 332, 343, 372] becomes [73, 227, 2, 30, 11, 29]; each delta fits in one byte. ES further groups documents into blocks of 256, encodes each block with the minimal bit width, and stores the bit‑width as a header.

For filter queries, ES uses Roaring Bitmaps to cache results. Bitmaps enable fast AND/OR operations but consume constant memory, so Roaring Bitmaps provide a compressed representation for sparse data.

Union Queries : When a filter cache is present, bitmaps are used directly for set operations. Without a cache, ES employs a skip‑list to traverse postings lists, skipping over irrelevant blocks and avoiding decompression of every block, thus reducing CPU cost.

Summary

ES uses inverted indexes to locate target documents quickly, trading higher space consumption for superior search performance.

Term dictionaries and FST‑based term indexes enable fast term lookup while minimizing memory and disk I/O.

FOR compression dramatically reduces postings‑list size.

Roaring Bitmaps accelerate filter queries with low memory overhead.

For union queries, bitmap operations are used when cached; otherwise, skip‑lists skip unnecessary traversals and decompression.

Elasticsearch Indexing Tips

Explicitly disable indexing for fields that do not need it.

Define non‑analyzed string fields to avoid unnecessary analysis.

Prefer predictable, low‑entropy IDs over random UUIDs for better query performance.

Author: Richard_Yi Source: juejin.cn/post/6889020742366920712

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Lucene inverted index compression

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.