Big Data 36 min read

Elasticsearch Overview: Core Concepts, Architecture, and Performance Optimization

This article provides a comprehensive overview of Elasticsearch, covering its data types, Lucene-based inverted index, cluster architecture, sharding and replication mechanisms, mapping definitions, basic usage, health monitoring, storage internals, and practical performance tuning tips for large‑scale search deployments.

Selected Java Interview Questions

Mar 9, 2022

Elasticsearch Overview: Core Concepts, Architecture, and Performance Optimization

Data in Everyday Life

Data can be structured (tables stored in relational databases) or unstructured (documents, images, audio, video, etc.). Search over these leads to structured‑data search and full‑text search for unstructured data.

Lucene and Inverted Index

Apache Lucene provides the core inverted‑index technology used by Elasticsearch. An inverted index maps each term to the documents that contain it, enabling fast full‑text retrieval.

Term          Doc_1    Doc_2   Doc_3
-------------------------------------
Java          |   X   |        |
is            |   X   |   X    |   X
the           |   X   |   X    |   X
best          |   X   |   X    |   X
programming   |   X   |   X    |   X
language      |   X   |   X    |   X
PHP           |       |   X    |
Javascript    |       |        |   X

Key terminology includes Term , Term Dictionary , Post List , and Inverted File .

Elasticsearch Core Concepts

Elasticsearch is a distributed, near‑real‑time search and analytics engine built on Lucene. It offers a simple RESTful API, automatic clustering, sharding, replication, and high availability.

Cluster

A cluster consists of one or more nodes that share the same cluster.name. Node discovery and master election are handled by Zen Discovery, which supports unicast and file‑based discovery.

discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

Node roles are configured in elasticsearch.yml (e.g., node.master: true, node.data: true).

Sharding and Replicas

Indices are split into primary shards; each primary can have replica shards for fault tolerance. Document routing uses the formula:

shard = hash(routing) % number_of_primary_shards

Routing defaults to the document _id but can be customized.

PUT /myIndex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Mapping

Mappings define field types (text, keyword, integer, date, etc.) and can be dynamic or explicit. Example of an explicit mapping:

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name":  {"type": "text"},
        "age":   {"type": "integer"},
        "created": {"type": "date", "format": "strict_date_optional_time||epoch_millis"}
      }
    }
  }
}

Basic Usage

Download and unzip Elasticsearch, then start it with bin/elasticsearch. The default HTTP port is 9200; accessing http://localhost:9200 returns a JSON object with cluster name, node name, version, and tagline.

Cluster Health

Health status is reported as green (all primary and replica shards active), yellow (all primaries active but some replicas missing), or red (one or more primary shards unavailable).

Write Path and Storage

When a document is indexed, it is first written to memory and appended to the transaction log (translog). Periodically (default 1 s or when memory thresholds are reached) a refresh creates a new immutable segment in the file‑system cache, making the data searchable. When the translog reaches 512 MB or 30 min, a flush writes the segment to disk, creates a commit point, and clears the translog.

Segments are immutable; deletions are recorded in a .del file and physically removed only during background segment merging, which also consolidates small segments into larger ones to reduce file‑handle and CPU overhead.

Performance Optimizations

Use SSDs, RAID‑0, or multiple path.data directories to maximize I/O throughput.

Avoid remote mounts (NFS/SMB) and be cautious with cloud block storage such as AWS EBS.

Compress term dictionaries with FST, tune index.refresh_interval, and disable replicas during bulk indexing (set index.number_of_replicas: 0).

Allocate JVM heap (Xms = Xmx) to no more than 50 % of physical RAM and consider G1GC for better pause‑time behavior.

Prefer keyword fields over text when sorting/aggregating, and disable doc values on fields that do not require them.

Use scroll APIs instead of deep pagination to avoid costly from+size queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Elasticsearch Sharding Lucene replication cluster

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.