Big Data 35 min read

Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization

This article provides a comprehensive introduction to Elasticsearch, covering data types, the role of Lucene, cluster architecture, node roles, discovery mechanisms, shard and replica management, mapping, installation, health monitoring, indexing workflow, storage internals, refresh and translog processes, segment merging, and practical performance and JVM tuning tips.

Architecture Digest
Architecture Digest
Architecture Digest
Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization

Elasticsearch is an open‑source, distributed search and analytics engine built on Apache Lucene; it transforms unstructured data into searchable indexes using inverted indices.

Data in everyday life can be classified as structured (e.g., relational tables) or unstructured (e.g., documents, images, videos). Correspondingly, searches are either structured‑data searches or full‑text searches.

Full‑text search relies on an inverted index: each unique term is listed with the documents that contain it. The article illustrates this with a simple term‑document table and shows how Lucene builds the term dictionary, posting list, and inverted file.

Elasticsearch wraps Lucene to provide a RESTful API, distributed capabilities, and easy installation. It uses a cluster of nodes, each identified by cluster.name and node.name . Nodes can be master‑eligible, data, or coordinating nodes, and roles are configured in elasticsearch.yml (e.g., node.master: true , node.data: true ).

Cluster discovery is handled by Zen Discovery, which uses unicast ping lists (e.g., discovery.zen.ping.unicast.hosts: ["host1", "host2:port"] ) to elect a master node. To avoid split‑brain scenarios, the discovery.zen.minimum_master_nodes setting defines the quorum required for a valid election.

Data is sharded for horizontal scalability; an index is divided into a fixed number of primary shards (e.g., "number_of_shards": 5 ) and each primary can have replica shards ( "number_of_replicas": 1 ). Shard placement follows the formula shard = hash(routing) % number_of_primary_shards , where routing defaults to the document _id .

Mappings define field types (e.g., text , keyword , date , integer ) similar to a database schema. An example index creation with settings and mappings is shown:

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name":  {"type": "text"},
        "age":   {"type": "integer"},
        "created": {"type": "date", "format": "strict_date_optional_time||epoch_millis"}
      }
    }
  }
}

Installation is straightforward: download, unzip, and run bin/elasticsearch . By default it listens on port 9200, returning cluster information in JSON.

Cluster health is reported as green, yellow, or red, indicating the status of primary and replica shards.

When indexing, documents are first written to memory and recorded in a transaction log (translog). Periodic refreshes (default every second) create new immutable segments that become searchable, while flushes (triggered by size or time) commit segments to disk and clear the translog.

Segments are immutable; deletions are recorded in a .del file, and updates are implemented as delete‑plus‑add. Background segment merging consolidates small segments, removes deleted documents, and reduces file‑handle and CPU overhead.

Performance tuning recommendations include using SSDs (preferably in RAID 0), avoiding remote mounts, configuring multiple path.data directories, optimizing shard and replica counts, adjusting index.refresh_interval , disabling unnecessary doc values, using keyword instead of text where appropriate, and employing scroll for deep pagination.

JVM tuning advice: set -Xms and -Xmx to the same value (not exceeding 50 % of physical RAM or 32 GB), consider G1GC over CMS, and ensure ample filesystem cache for fast search.

distributed systemsPerformance OptimizationSearch EngineElasticsearchInverted IndexReplicaShard
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.