Big Data 16 min read

How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes and Compression

This article explains how Elasticsearch uses inverted indexes, term dictionaries, and advanced compression techniques like Frame of Reference and Roaring Bitmaps to enable rapid, scalable search over massive datasets, contrasting its approach with traditional relational database queries and detailing practical optimization tips.

Java Interview Crash Guide
Java Interview Crash Guide
Java Interview Crash Guide
How Elasticsearch Achieves Lightning‑Fast Search with Inverted Indexes and Compression

Introduction

Recently I have been working on projects that use Elasticsearch (ES) for data storage and search analytics, so I studied ES. This article is based on a technical sharing session.

What Will Be Covered

Differences between traditional relational databases and ES

Search engine principles

Inverted index details

Postings list tricks

Union queries

Search Basics

Example: searching ancient poems containing the character “前”. Using a relational database would require a sequential scan like:

select name from poems where content like "%前%";

This approach scans all rows and cannot efficiently return partial matches such as “A”, “AB”, “ABC”. Hence dedicated search engines like ES are needed.

Search Engine Principles

The typical steps are:

Content crawling and stop‑word filtering

Tokenization and keyword extraction

Building an inverted index

User query processing

The inverted index is the core of ES, implemented via Lucene.

Inverted Index

Using the earlier poem example, the inverted index allows direct lookup of documents containing the term “前”. The index consists of a term dictionary, term index, and postings list.

Key Concepts

Term

A keyword in ES is called a term .

Postings List

A postings list is the set of document IDs that contain a given term, e.g. {静夜思, 望庐山瀑布}. IDs are stored as integers for efficient compression.

Term Dictionary and Term Index

All terms are sorted and stored in a term dictionary . The term index (a trie/FST) maps term prefixes to dictionary offsets, enabling fast binary search similar to B‑tree indexes.

Internal Structure of the Index

Lucene compresses the term dictionary using block storage with common‑prefix compression and stores the term index in memory as a Finite State Transducer (FST), which is space‑efficient and provides O(len) lookup.

Postings List Optimizations

Two main techniques are used:

Frame of Reference (FOR) – documents are divided into blocks (e.g., 256 docs per block) and each block stores delta‑encoded IDs with a header indicating the bit width.

Roaring Bitmaps – used for filter caches; they store dense bitmaps efficiently and support fast AND/OR operations.

FOR Compression Example

Given IDs [73, 300, 302, 332, 343, 372], delta encoding yields [73, 227, 2, 30, 11, 29], allowing each value to fit in a single byte.

Roaring Bitmap Compression

Doc IDs are split into high 16‑bit and low 16‑bit parts. High bits are aggregated, and low bits are stored using either an ArrayContainer (for < 4096 entries) or a BitmapContainer (otherwise), reducing space.

Union Queries

If a filter cache exists, ES uses bitmap operations to compute intersections/unions. Without a cache, a skip‑list algorithm traverses postings lists block by block, skipping non‑matching IDs and avoiding decompression of irrelevant blocks.

Summary

ES uses inverted indexes to achieve fast document retrieval, trading higher space usage for query speed.

Lucene’s term dictionary, term index (FST), and postings list form the backbone of the index.

FOR compression dramatically reduces postings list size on disk.

Roaring Bitmaps provide efficient in‑memory filter caching.

Union queries leverage bitmap operations when cached, otherwise use skip‑list traversal to minimize CPU and I/O.

Practical Tips for Using Elasticsearch

Explicitly disable indexing for fields that do not need it; ES indexes fields by default.

For string fields that do not require analysis, set them as not_analyzed.

Avoid highly random IDs (e.g., UUIDs); use monotonic or patterned IDs for better performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

search engineElasticsearchluceneinverted indexcompressionPostings List
Java Interview Crash Guide
Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.