Databases 8 min read

How ElasticSearch Delivers Near Real-Time Search with Immutable Indexes

ElasticSearch achieves near real-time search by building immutable inverted indexes (segments), using incremental indexing, logical deletions, background segment merging, and a write-ahead translog to ensure durability, while distributing shards across nodes to balance load and maintain data consistency.

21CTO
21CTO
21CTO
How ElasticSearch Delivers Near Real-Time Search with Immutable Indexes

1. Real-Time vs Near Real-Time

Real-time search means that after a document is inserted into the database it can be found immediately. Near real-time is slightly slower, allowing a short delay.

2. Challenges of Near Real-Time

Implementing near real-time on a single-node system is difficult because it must guarantee persistence and use caching to accelerate access. In a distributed system like ElasticSearch, the challenge is to persist data while initializing internal full-text structures.

2.1 Immutable Data Structures

Concurrent programming struggles with mutable data; functional programming solves this with immutable structures. ElasticSearch relies on Lucene’s immutable inverted index, which stores term statistics for each document.

2.2 From Immutable to Mutable

When new documents are indexed, Lucene creates a new immutable segment. Incremental indexing builds new segments, and logical deletions are recorded in a “del” structure. Updates are handled by marking old documents and adding new ones in fresh segments. Each segment is called a “Segment”, and a collection of segments forms an “Index”.

In ElasticSearch an index is divided into shards; each shard corresponds to a Lucene index. Shards are allocated to different nodes and can be rebalanced.

2.3 Distributed Data Storage

ElasticSearch shards data and routes requests; replicas are synchronized after primary operations.

2.4 Disk I/O Challenges

Segments are built in the filesystem cache and flushed to disk via fsync. ElasticSearch refreshes every second, creating a new segment, which is why it offers near real-time rather than true real-time.

2.5 Ensuring Data Durability

To avoid data loss on node failure, ElasticSearch uses a write-ahead log called translog. Every buffered document is also written to the translog. On restart, committed segments are loaded from the commit point and the translog is replayed. The translog is flushed to disk every 5 seconds or when large, limiting potential data loss.

3. Learning ElasticSearch

Understanding the design of immutable indexes, segment merging, and translog provides deeper insight than merely learning commands. Practical mastery comes from reading source code after grasping these core concepts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

inverted indexSegment MergingtranslogNear Real-Time Search
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.