Big Data 36 min read

Elasticsearch Overview: Architecture, Core Concepts, Indexing Mechanics, and Performance Optimization

This comprehensive article explains what Elasticsearch is, how it builds on Lucene to provide distributed real‑time search and analytics, covering data types, cluster components, shard routing, indexing pipelines, storage formats, segment merging, and practical performance‑tuning tips for production deployments.

Top Architect

Apr 9, 2022

Elasticsearch Overview: Architecture, Core Concepts, Indexing Mechanics, and Performance Optimization

Elasticsearch is an open‑source, Java‑based search engine that uses Apache Lucene for indexing and full‑text search, providing a distributed, near‑real‑time search and analytics platform.

Data in everyday life can be classified as structured (row‑based tables stored in relational databases) or unstructured (documents, HTML, images, video). Structured data is searched via traditional database indexes, while unstructured data requires full‑text search, which is achieved by creating an inverted index.

Lucene is a library that implements inverted indexes; Elasticsearch wraps Lucene to hide its complexity and adds a RESTful API, clustering, and distributed capabilities. Solr and Elasticsearch are the two mature search engines built on Lucene, but Elasticsearch includes built‑in clustering and easier deployment.

Core Elasticsearch concepts include clusters (a set of nodes sharing a cluster name), nodes (each node can be a master‑eligible node, a data node, or both), and the Zen Discovery module that handles node discovery and master election using unicast hosts and a minimum master node quorum to avoid split‑brain scenarios.

Node roles are configured in elasticsearch.yml (e.g., node.master: true, node.data: true). Master nodes manage index metadata and shard allocation, while data nodes store shards and handle indexing and search requests. Coordinating nodes forward client requests to the appropriate primary shard.

discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

Elasticsearch stores data in shards (horizontal slices of an index) and replicas (copies of primary shards). The number of primary shards is fixed at index creation; replicas can be added later for high availability. Shard routing is determined by the formula shard = hash(routing) % number_of_primary_shards, where routing defaults to the document’s _id.

shard = hash(routing) % number_of_primary_shards

When a document is indexed, it is first written to the primary shard, then replicated to its replicas. Writes are initially stored in memory and recorded in the transaction log (translog) to guarantee durability before being flushed to disk.

Elasticsearch uses a segment storage model: each segment is an immutable inverted index file on disk. New documents create new segments; deletions are recorded in a .del file, and updates are treated as delete‑plus‑insert. Segments are periodically merged in the background to reduce the number of files, reclaim space, and improve search performance.

Refresh operations (default every second) make newly indexed documents visible to search by opening a new segment in the file system cache. Flush operations (triggered when the translog reaches 512 MB or 30 minutes) write segments to disk, create a commit point, and clear the translog.

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name":  {"type": "text"},
        "age":   {"type": "integer"},
        "created": {"type": "date", "format": "strict_date_optional_time||epoch_millis"}
      }
    }
  }
}

Performance optimization recommendations include using SSDs or RAID 0 for storage, avoiding remote mounts, configuring multiple path.data directories, and tuning index settings such as refresh_interval, number of replicas, and routing values. Reducing unnecessary fields, using keyword instead of text when full‑text analysis is not needed, and disabling doc values for fields that are never aggregated can also improve speed.

JVM tuning is crucial: set the heap size with identical -Xms and -Xmx values (typically ≤ 50 % of physical RAM and ≤ 32 GB), consider the G1 garbage collector, and ensure ample OS file‑system cache for fast segment reads.

Overall, Elasticsearch combines Lucene’s powerful inverted‑index search with distributed architecture, providing scalable, fault‑tolerant search and analytics for both structured and unstructured data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Indexing Search Engine Elasticsearch Lucene

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.