Backend Development 9 min read

How Elasticsearch Achieves Near Real-Time Search: Core Mechanisms Explained

This article explains how Elasticsearch implements near real-time search by using immutable inverted indexes, segment merging, shard distribution, and a translog for durability, detailing the challenges of persistence, disk I/O, and data recovery in a distributed environment.

Programmer DD

Nov 4, 2021

How Elasticsearch Achieves Near Real-Time Search: Core Mechanisms Explained

01 Near Real-Time Search

1.1 Real-Time vs Near Real-Time

Real-time search means that newly inserted data can be searched immediately, while near real-time is slightly slower, allowing a short delay after insertion.

1.2 Challenges of Near Real-Time

On a single-node system, achieving near real-time requires persistence and caching to speed up access. In a distributed system like Elasticsearch, maintaining persistence while building the internal data structures for full-text search makes near real-time search difficult, which is the core problem Elasticsearch solves.

02 How Elasticsearch Works

2.1 Immutable Data Structures

Concurrent programming struggles with mutable data; functional programming solves this with immutable data. Elasticsearch relies on Lucene's immutable inverted index, which stores term statistics such as term frequency, document length, and positions.

2.2 From Immutable to Mutable

When documents are added, Lucene creates a new immutable inverted index (segment). New data triggers the creation of another segment (incremental save). Deletions and updates are handled by logical markers: a delete list (del) marks removed documents, and updates are stored as new documents in new segments. Each segment is called a "Segment" and the collection of segments forms an "Index".

In Elasticsearch, a database is called an Index; each Index can be divided into multiple Shards, which are Lucene Indexes distributed across nodes and rebalanced under load.

2.3 Distributed Data Storage

Elasticsearch shards data and routes it; each shard is a Lucene Index. Primary shards are written first, then replicated to replica shards.

2.4 Disk I/O Challenges

Elasticsearch refreshes every second, creating a new segment from buffered documents. This interval enables near real-time search but can cause many small segments, increasing file handles and search overhead. A background merge thread periodically combines small segments into larger ones, removes deleted or outdated files, and updates the commit point without affecting insert or search performance.

2.5 Ensuring Data Durability

Elasticsearch uses a Write-Ahead Log called the translog. Every buffered document is also written to the translog. On restart, Elasticsearch replays the translog after loading committed segments, ensuring no data loss. The translog is flushed to disk every 5 seconds, limiting potential loss to a few seconds of data.

03 How to Deeply Study Elasticsearch

After gaining distributed system and development experience, readers can focus on sections 2.1 and 2.2 to understand the core design. Understanding the underlying concepts is essential; the implementation details are best explored directly in the source code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch inverted index Data persistence Near Real-Time Search

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.