Databases 35 min read

Understanding Elasticsearch: Architecture, Core Concepts, and Performance Optimization

This article provides a comprehensive overview of Elasticsearch, covering its role in handling structured and unstructured data, core concepts such as Lucene, inverted indexes, clusters, shards, replicas, mapping, indexing processes, storage mechanisms, and practical performance tuning tips for deployment.

Big Data Technology Architecture

Aug 9, 2019

Understanding Elasticsearch

Elasticsearch is an open‑source, distributed search and analytics engine built on Apache Lucene. It enables fast full‑text search over both structured (e.g., relational tables) and unstructured data (documents, images, videos) by creating inverted indexes.

Core Concepts

Lucene provides the low‑level indexing and query capabilities; Elasticsearch wraps Lucene to expose a RESTful API. The fundamental data structures are the term dictionary and the inverted file , which together form the inverted index.

Each document is tokenized into terms, which are stored in the term dictionary. The inverted file maps each term to the list of documents (postings) where it appears.

Cluster and Nodes

A cluster consists of one or more nodes. Nodes can act as master‑eligible, data, or coordinating nodes, configured via node.master and node.data settings. Master nodes manage cluster state, shard allocation, and elections using Zen Discovery.

Sharding and Replication

Indexes are split into a fixed number of primary shards; each shard can have multiple replica shards for high availability. Shard placement follows the formula: shard = hash(routing) % number_of_primary_shards Routing defaults to the document _id but can be customized.

Mapping

Mappings define field types, analyzers, and indexing options. Elasticsearch supports dynamic mapping (automatic type detection) and explicit static mapping for precise control.

PUT /myIndex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name": {"type": "text"},
        "age": {"type": "integer"},
        "created": {"type": "date", "format": "strict_date_optional_time||epoch_millis"}
      }
    }
  }
}

Indexing Process

Incoming documents are first written to the translog and held in memory. Periodic refresh (default 1 s) creates a new segment in the OS page cache, making the data searchable. When the translog reaches a size or time threshold, a flush persists the segment to disk and clears the translog.

Segment Management

Segments are immutable; new data creates new segments. Deleted or updated documents are marked in .del files and later removed during background segment merges, which also consolidate small segments into larger ones to improve search performance.

Performance Tuning

Key optimizations include using SSDs, configuring multiple data paths, disabling unnecessary replicas during bulk loads, adjusting refresh_interval, choosing appropriate shard counts, using keyword fields when full‑text analysis is not needed, and tuning JVM heap and garbage collection.

Cluster Health

Cluster health is reported as green (fully functional), yellow (primary shards OK but some replicas missing), or red (primary shards unavailable). Monitoring can be done via the GET /_cluster/health API.

Overall, Elasticsearch combines Lucene’s powerful indexing with distributed architecture, offering near‑real‑time search, scalability, and rich configuration for a wide range of data‑driven applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch Sharding Performance Tuning replication inverted index

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.