How ElasticSearch Delivers Near Real-Time Search with Immutable Indexes
ElasticSearch achieves near real-time search by building immutable inverted indexes (segments), using incremental indexing, logical deletions, background segment merging, and a write-ahead translog to ensure durability, while distributing shards across nodes to balance load and maintain data consistency.
1. Real-Time vs Near Real-Time
Real-time search means that after a document is inserted into the database it can be found immediately. Near real-time is slightly slower, allowing a short delay.
2. Challenges of Near Real-Time
Implementing near real-time on a single-node system is difficult because it must guarantee persistence and use caching to accelerate access. In a distributed system like ElasticSearch, the challenge is to persist data while initializing internal full-text structures.
2.1 Immutable Data Structures
Concurrent programming struggles with mutable data; functional programming solves this with immutable structures. ElasticSearch relies on Lucene’s immutable inverted index, which stores term statistics for each document.
2.2 From Immutable to Mutable
When new documents are indexed, Lucene creates a new immutable segment. Incremental indexing builds new segments, and logical deletions are recorded in a “del” structure. Updates are handled by marking old documents and adding new ones in fresh segments. Each segment is called a “Segment”, and a collection of segments forms an “Index”.
In ElasticSearch an index is divided into shards; each shard corresponds to a Lucene index. Shards are allocated to different nodes and can be rebalanced.
2.3 Distributed Data Storage
ElasticSearch shards data and routes requests; replicas are synchronized after primary operations.
2.4 Disk I/O Challenges
Segments are built in the filesystem cache and flushed to disk via fsync. ElasticSearch refreshes every second, creating a new segment, which is why it offers near real-time rather than true real-time.
2.5 Ensuring Data Durability
To avoid data loss on node failure, ElasticSearch uses a write-ahead log called translog. Every buffered document is also written to the translog. On restart, committed segments are loaded from the commit point and the translog is replayed. The translog is flushed to disk every 5 seconds or when large, limiting potential data loss.
3. Learning ElasticSearch
Understanding the design of immutable indexes, segment merging, and translog provides deeper insight than merely learning commands. Practical mastery comes from reading source code after grasping these core concepts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
