Big Data 11 min read

Understanding the Underlying Working Principles of ElasticSearch

This article explains ElasticSearch’s architecture and core mechanisms—including its reliance on Lucene segments, inverted indexes, stored fields, document values, caching, shard routing, and scaling strategies—while answering common questions about wildcard matching, index compression, and memory usage.

Architect

Apr 15, 2024

Understanding the Underlying Working Principles of ElasticSearch

Abstract

We introduce the underlying working principles of ElasticSearch from top‑down and bottom‑up, aiming to answer why certain wildcard queries (e.g., foo-bar* ) fail, why adding more files compresses the index, and why ElasticSearch consumes a lot of memory.

Content Overview

ElasticSearch is built on top of Lucene. A node (Node) represents a white square box in the cloud diagram. Multiple nodes together form an ElasticSearch index, which is composed of green boxes called shards. Each shard is essentially a Lucene index.

Diagram of ElasticSearch

Images illustrate the cluster, nodes, shards, and their relationships.

Diagram of Lucene

Mini‑index – segment

Lucene stores data in many small segments, which can be viewed as mini‑indexes.

Segment internals

Inverted Index

Stored Fields

Document Values

Cache

Inverted Index

The inverted index consists of two parts: a sorted dictionary of terms (including term frequency) and postings that list the documents containing each term. When a query is issued, the term is looked up in the dictionary to retrieve matching documents.

Query example: "the fury"

Shows how the term is resolved in the inverted index.

Auto‑completion (prefix)

Binary search can quickly find terms that start with a given prefix, such as "choice" or "coming".

Expensive lookup

Scanning the entire inverted index for a substring like "our" is costly.

Problem transformation

Possible solutions include suffix reversal, GEO hashing, and multi‑form numeric terms.

Spelling correction

A Python library builds a tree‑based state machine to handle misspelled terms.

Stored Fields

When searching for specific field content (e.g., a title), the inverted index is insufficient. Stored Fields provide a simple key‑value store; ElasticSearch stores the entire JSON source by default.

Document Values

To support sorting, aggregation, and faceting without loading unnecessary data, Document Values store column‑oriented data optimized for same‑type fields. ElasticSearch can load all Document Values of a shard into memory, improving speed at the cost of memory usage.

Search Execution

During a search, Lucene scans all segments, merges results, and returns them to the client. Key characteristics:

Segments are immutable; deletions are marked, and updates are performed by re‑indexing.

Lucene aggressively compresses data and caches information to boost query performance.

Caching Story

When indexing a document, ElasticSearch creates caches that are refreshed every second. Over time many segments accumulate; ElasticSearch merges them, which can reduce index size due to compression.

Shard Search

Searching a shard mirrors Lucene segment search, but shards may reside on different nodes, requiring network transmission. A single query may involve multiple shard searches.

Scaling

Shards cannot be split further but can be moved across nodes. Adding nodes may require re‑indexing, so careful planning of node‑to‑shard allocation and replica configuration is essential.

Routing

Each node maintains a routing table, allowing any request to be forwarded to the appropriate shard.

Real Request Example

A sample request includes a filtered query with a multi_match clause and an aggregation that groups results by author to retrieve the top‑10 authors.

The request can be dispatched to any node, which becomes the coordinator, determines target shards, selects available replicas, and routes the query.

Pre‑search processing

ElasticSearch converts the query into a Lucene query, executes it across all segments, caches filter results, but does not cache queries themselves, leaving caching responsibility to the application.

Result Return

After execution, results travel back up the coordination path and are merged before being returned to the client.

Overall, the article provides a comprehensive visual and textual walkthrough of ElasticSearch’s architecture, its reliance on Lucene, data structures, search flow, caching, scaling, and practical query examples.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Search Engine Lucene

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.