Backend Development 17 min read

Mastering Elasticsearch: Core Concepts, Indexing, and Real‑Time Search Explained

This comprehensive guide walks through Elasticsearch fundamentals, including its architecture, core concepts like indices, shards, and replicas, the write and update processes, search workflow, consistency mechanisms, master election, performance tuning, and strategies for deep pagination and scroll searches.

Java Interview Crash Guide

May 21, 2021

Mastering Elasticsearch: Core Concepts, Indexing, and Real‑Time Search Explained

1. Elasticsearch Basics

Elasticsearch is a distributed, real‑time, RESTful full‑text search engine built on Lucene; every field is indexed and searchable, enabling fast storage, search, and analysis of massive data sets.

Full‑text retrieval creates an index for each term, recording its frequency and position, allowing queries to be answered by consulting these indexes, similar to looking up words in a dictionary.

Key concepts :

Index : comparable to a MySQL database; stores a collection of similarly structured documents.

Type : defines a logical data classification within an index, akin to a table.

Document : the smallest unit of data, similar to a row, but each document may have different fields.

Field : the atomic unit inside a document; a document contains multiple fields.

Shard : an index is split into multiple shards that can be distributed across nodes for horizontal scaling and higher throughput.

Replica : copies of shards that provide fault tolerance and improve search performance; a typical high‑availability setup uses 5 primary shards and 5 replica shards across at least two nodes.

Inverted index maps terms to the document IDs that contain them, enabling fast query resolution.

DocValues supplement the inverted index with a column‑oriented structure that stores doc‑ID‑to‑field‑value mappings on disk, allowing efficient sorting and aggregation without exhausting heap memory.

Text vs. Keyword : keyword fields are not analyzed and support exact match searches; text fields are analyzed (tokenized) before indexing.

Stop‑word filtering removes common, meaningless terms (e.g., "的", "而") from the index.

Query vs. Filter : queries compute relevance scores, while filters only test boolean criteria and can be cached for better performance.

2. Elasticsearch Write Process

The write flow involves a coordinating node that routes the document to the appropriate primary shard, which then replicates the operation to its replica shards. The coordinating node returns a response only after the primary and all replicas have successfully processed the request.

Internally, documents are first written to a memory buffer; every second the buffer is flushed to a new segment file (the refresh operation), making the data searchable.

To guarantee durability, each write is also appended to a transaction log ( translog). If a node crashes, Elasticsearch replays the translog to recover the buffer and cache.

When the translog grows beyond a size or time threshold, a flush creates a commit point, forces all cached data to disk, and starts a new translog.

3. Update and Delete Process

Documents are immutable; deletions are recorded in a .del file, marking the document as deleted while keeping it searchable until the next merge.

Updates are performed by marking the old document as deleted and indexing a new version. During segment merges, deleted documents are physically removed, and a new commit point is written.

4. Search Workflow

Elasticsearch executes searches in two phases: Query and Fetch . The coordinating node broadcasts the query to relevant primary or replica shards, each building a priority queue of matching document IDs. The coordinating node merges these results, performs sorting, pagination, and then fetches the actual documents from the shards.

Load balancing is achieved by routing document IDs to shards using a hash and selecting a random replica for each fetch. For higher scoring accuracy on small shards, the DFS Query Then Fetch variant performs a pre‑query to gather term and document frequencies.

5. Consistency Under High Concurrency

Updates use optimistic concurrency control via the _version field; a write succeeds only if the supplied version matches the current one.

Write consistency can be set to quorum (default), one, or all, determining how many shard copies must acknowledge the operation.

Read consistency can be enforced by setting replication=sync (default) or by explicitly requesting the primary shard with _preference=primary.

6. Master Node Election

Elasticsearch uses the ZenDiscovery module for master election, which relies on ping and unicast mechanisms.

Nodes with node.master: true are candidates. The election selects the node with the lowest ID after sorting candidates; a node becomes master once it receives votes from at least discovery.zen.minimum_master_nodes nodes and votes for itself.

To avoid split‑brain scenarios, the minimum master nodes setting should be configured to (N/2)+1 where N is the number of master‑eligible nodes.

7. Indexing Performance Tips

Use SSD storage.

Batch requests (5–15 MB per bulk) to reduce overhead.

Temporarily set index.number_of_replicas: 0 during massive imports.

Increase index.refresh_interval (e.g., to 30 s) if near‑real‑time freshness is not required.

Adjust segment merge throttling based on hardware (e.g., 100–200 MB/s on SSD).

Raise index.translog.flush_threshold_size (e.g., to 1 GB) to reduce flush frequency.

8. Deep Pagination and Scroll Search

Deep pagination (beyond 10,000 results) degrades performance and is not supported; limit user navigation to a reasonable number of pages (e.g., 100).

For extracting large result sets, use the scroll API: the initial query returns a scroll ID, which is then used to retrieve subsequent batches, each based on a snapshot of the index at the time of the first request.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Search Engine Elasticsearch real-time-search

Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.