Big Data 36 min read

Understanding Elasticsearch: Core Concepts, Architecture, Indexing Mechanics and Performance Optimization

This article explains the fundamentals of structured and unstructured data, introduces Lucene's inverted index, describes Elasticsearch's distributed cluster architecture, node roles, sharding and replication mechanisms, indexing workflow with refresh and translog, storage segment model, and provides practical performance‑tuning recommendations.

Selected Java Interview Questions

Jul 5, 2022

Understanding Elasticsearch: Core Concepts, Architecture, Indexing Mechanics and Performance Optimization

Search engines retrieve data, which can be divided into two categories: structured data stored in relational databases and unstructured (full‑text) data such as documents, HTML, images, and videos. Accordingly, search can be classified as structured‑data search or unstructured‑data search.

For unstructured data, full‑text search is required. Lucene, an open‑source library, provides the core inverted‑index mechanism that enables fast keyword lookup. An inverted index lists each unique term and the documents in which it appears, as shown in the example table below.

Term          Doc_1   Doc_2   Doc_3
--------------------------------
Java          |   X   |        |
is            |   X   |   X    |   X
the           |   X   |   X    |   X
best          |   X   |   X    |   X
programming   |   X   |   X    |   X
language      |   X   |   X    |   X
PHP           |       |   X    |   
Javascript    |       |        |   X
--------------------------------

Lucene's key terms include Term (the smallest searchable unit), Term Dictionary (the collection of all terms), Post List (the inverted list of document IDs for each term), and Inverted File (the physical file storing the inverted index).

Elasticsearch builds on Lucene to provide a distributed, near‑real‑time search and analytics engine. It hides Lucene's complexity behind a RESTful API and adds features such as clustering, automatic node discovery (Zen Discovery), and built‑in coordination.

A cluster consists of one or more nodes that share the same cluster.name. Nodes can assume different roles:

Master‑eligible node : participates in master election and manages cluster state.

Data node : stores shards and handles CRUD and aggregation operations.

Coordinating node : receives client requests, routes them to the appropriate shards, and merges results. Any node can act as a coordinating node.

Zen Discovery uses unicast (or file‑based) discovery to find other nodes, elect a master based on node IDs, and enforce a discovery.zen.minimum_master_nodes quorum to avoid split‑brain scenarios.

Elasticsearch splits an index into primary shards and replica shards . The number of primary shards is fixed at index creation because the routing formula

shard = hash(routing) % number_of_primary_shards

uses the document’s _id (or a custom routing value) to deterministically select the target shard. Replicas provide high availability; a replica is never placed on the same node as its primary.

Indexing workflow:

Client sends a write request to any node (coordinator).

The coordinator computes the target primary shard via the routing formula and forwards the request.

The primary writes the document to its in‑memory buffer and appends the operation to the transaction log ( translog) for durability.

Periodically (default 1 s) a refresh creates a new immutable segment in the file‑system cache, making the document searchable.

When the translog reaches 512 MB or 30 min, a flush persists the segment to disk, creates a commit point, and clears the translog.

Replica shards receive the operation, write it to their own translog, and acknowledge success before the coordinator returns to the client.

Segments are immutable; deletions are recorded in a .del file and physically removed only during background segment merging, which also consolidates small segments into larger ones to reduce file‑handle and CPU overhead.

Performance‑tuning recommendations include:

Use SSDs and RAID‑0 for high I/O throughput; avoid remote mounts (NFS, SMB) and slow cloud block storage.

Choose sequential, compressible document IDs instead of random UUIDs to improve Lucene’s term dictionary compression.

Disable doc_values on fields that are never used for sorting or aggregations.

Prefer keyword over text for fields that do not require full‑text analysis.

Adjust index.refresh_interval (e.g., 30s) or disable it ( -1) during bulk ingestion, and temporarily set number_of_replicas to 0.

Use routing values to target specific shards for better query locality.

Configure JVM heap (Xms = Xmx) to ≤ 50 % of physical RAM and consider G1GC for large heaps.

Allocate sufficient RAM for the OS file‑system cache, as Elasticsearch heavily relies on it for fast search.

By understanding Lucene’s inverted index, Elasticsearch’s cluster architecture, shard routing, and segment lifecycle, developers can design scalable search solutions and apply the above optimizations to achieve high throughput and low latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Elasticsearch Sharding Lucene replication cluster

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.