Analysis of Elasticsearch Write Operations and Underlying Mechanisms
This article examines how Elasticsearch implements write operations on top of Lucene, detailing the challenges of Lucene's write path and describing Elasticsearch's distributed design, near‑real‑time refresh, translog reliability, shard replication, partial updates, and the complete write workflow from coordinating node to primary and replica shards.
1. Lucene Write Operations and Issues
Elasticsearch relies on Lucene for document read/write operations. Lucene provides three core methods: public long addDocument(...); , public long deleteDocuments(...); , and public long updateDocument(...); . However, several problems exist:
No concurrency design: Lucene is a library without built‑in distributed features, so using it for massive data requires additional distributed design.
Not real‑time: After writing to Lucene, a document cannot be searched until a complete segment is generated.
Unreliable storage: Data written to Lucene resides in memory and may be lost if the server crashes before being persisted.
No partial updates: Lucene’s updateDocuments only supports full‑document replacement.
2. Elasticsearch Write Solutions
To overcome Lucene’s limitations, Elasticsearch introduces several designs.
2.1 Distributed Design
Elasticsearch splits an index into multiple primary shards, each with replica shards. Shards are allocated on different nodes, and replicas of the same shard never share a node. When a write request arrives, Elasticsearch computes the target shard using the routing formula:
shard_num = hash(_routing) % num_primary_shardsThe request is then routed to the appropriate primary shard.
2.2 Near Real‑time Refresh
After a document is written to Lucene, it is not immediately searchable. Elasticsearch periodically executes a refresh operation, which calls Lucene’s reopen (or openIfChanged in newer versions) to create a new segment from the in‑memory buffer, making the documents searchable. The interval is controlled by the refresh_interval setting (default 1 s). Clients can request an immediate refresh by adding refresh to the request or by calling the refresh API.
2.3 Data Storage Reliability
Translog: Every write is first recorded in the transaction log (translog) on disk. The durability and sync interval are configurable via index.translog.durability and index.translog.sync_interval , ensuring data is not lost on node failure.
Flush: Every 30 minutes or when the translog reaches a size threshold ( index.translog.flush_threshold_size , default 512 MB), Elasticsearch performs a flush: it refreshes, then commits all segments to disk, clearing the translog.
Merge: Small segments generated by frequent refreshes are merged into larger ones to improve query performance and reduce load. Users can trigger a manual _forcemerge to reduce segment count.
Replica Mechanism: Each primary shard has replica shards on different nodes, providing redundancy and high availability.
2.4 Partial Update
Elasticsearch stores the original JSON document in the _source field. For a partial update, the system retrieves the source, merges the changes, and then performs a full update on Lucene. Versioning ensures that concurrent updates do not overwrite each other.
3. Elasticsearch Write Process
Any node can act as a coordinating node. The coordinating node receives the request, determines routing, forwards it to the appropriate primary shard, and after the primary writes, the request is replicated to all replica shards. The overall flow is illustrated below:
3.1 Coordinating Node
The coordinating node performs the following steps:
Check for an ingest pipeline and execute it if applicable.
Auto‑create the index if it does not exist and auto‑creation is enabled.
Determine the routing value ( _routing or _id ).
Group bulk requests by target shard into BulkShardRequest objects.
Send each BulkShardRequest to the corresponding primary shard.
Wait for the primary shard’s response.
3.2 Primary Shard
The primary shard processes the request through a series of steps:
Identify the operation type (index, delete, update).
Convert update operations into a combination of index and delete actions.
Parse the document and add internal fields (e.g., _uid ).
Update mappings for new fields according to dynamic mapping rules.
Obtain a sequence ID and version from the SequencerNumberService .
Write the document to Lucene, handling version checks and ensuring atomicity.
Write the operation to the translog.
Rebuild the bulk request to contain only index or delete actions.
Flush the translog to disk (synchronously or asynchronously based on durability settings).
Send the bulk request to all replica shards and wait for their acknowledgments.
Update the primary shard’s local checkpoint after all replicas respond.
3.3 Replica Shard
Each replica shard executes a simplified flow:
Determine the operation type (add or delete).
Parse the document.
Update mappings if needed.
Use the sequence ID and version supplied by the primary.
Write the document to Lucene.
Write to the translog.
Flush the translog.
4. Summary and Analysis
Elasticsearch extends Lucene with a distributed architecture, shard replication, near‑real‑time refresh, translog, and periodic flush/merge operations, achieving high reliability, scalability, and flexible partial updates while preserving Lucene’s powerful search capabilities.
References
Elasticsearch Core Analysis – Write Path (Zhihu)
Official Elasticsearch Documentation
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.