Backend Development 9 min read

How Elasticsearch Achieves Near Real-Time Search: Core Techniques Explained

This article explains how Elasticsearch implements near real-time search by using immutable inverted indexes, segment merging, sharding, and a translog for durability, detailing the challenges and solutions behind its distributed full‑text search architecture.

MaGe Linux Operations

Oct 13, 2021

How Elasticsearch Achieves Near Real-Time Search: Core Techniques Explained

1. Near Real-Time Search

1.1 Real-Time vs Near Real-Time

Real-time search means that after inserting data into a database, it can be searched immediately. Near real-time is slightly slower than real-time.

1.2 Challenges of Near Real-Time

Implementing near real-time on a single-node system is difficult because it must ensure data persistence and use caching to speed access. In a distributed system like Elasticsearch, maintaining persistence while initializing internal data structures for full‑text search adds complexity, which Elasticsearch solves.

2. Implementation of Elasticsearch

2.1 Immutable Data Structures

Concurrent programming struggles with mutable data. Functional programming uses immutable data to avoid these issues. In Elasticsearch, the core data structure is the Inverted Index, which stores term statistics for documents. Near real-time search relies on the immutability of the Inverted Index, a feature inherited from Lucene.

2.2 From Immutable to Mutable

When a document is inserted, Lucene builds an immutable Inverted Index. For subsequent inserts, Elasticsearch uses incremental storage and logical markers. New data creates a new immutable Inverted Index; searches merge statistics from all indexes. Deletions and updates are handled via a delete (del) structure and logical markers, with each Inverted Index called a Segment, managed by an Index.

In Elasticsearch a database is called an Index; each Index can be divided into Shards, which are Lucene Indexes distributed across nodes and rebalanced under load.

The same idea appears in other data structures such as the Log‑Structured Merge Tree (LSM).

2.3 Distributed Data Storage

Elasticsearch shards data and routes it; each Shard is a Lucene Index. Primary shards are written first, then replicated to replica shards.

2.4 Disk I/O Challenges

Elasticsearch creates a new Segment from buffered documents roughly every second (refresh), providing near real-time rather than real-time search. Frequent segment creation can lead to many small files, increasing file handles and search overhead.

To mitigate this, a background merge thread combines small Segments into larger ones, discarding deleted or outdated files, updating the Commit Point without affecting insert or search performance.

2.5 Ensuring Data Durability

Elasticsearch uses a Write‑Ahead Log called translog. Every insert writes to both the buffer and translog. On restart, committed Segments are loaded from disk, then translog operations are replayed. Translog is flushed to disk every 30 minutes or when large, and synced every 5 seconds, limiting potential data loss to a few seconds.

3. Further Learning

After gaining distributed systems and development experience, sections 2.3 and 2.5 can be skimmed as they cover common concepts. Mastering Elasticsearch requires understanding both basic usage and the design principles explained in sections 2.1 and 2.2; deeper insight comes from reading source code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Elasticsearch Lucene inverted index Near Real-Time Search

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.