Big Data 35 min read

Mastering Elasticsearch: Core Concepts, Architecture, and Performance Tips

This comprehensive guide explains what Elasticsearch does, its underlying Lucene technology, core concepts such as clusters, shards, replicas, mapping, indexing and storage mechanisms, and provides practical performance‑tuning advice for building and operating a robust distributed search engine.

21CTO

Feb 11, 2023

Mastering Elasticsearch: Core Concepts, Architecture, and Performance Tips

Data in Everyday Life

Data can be divided into structured (row‑based, stored in relational databases) and unstructured (full‑text, documents, images, video, etc.). Correspondingly, searches are either structured‑data search or unstructured‑data (full‑text) search.

From Lucene to Elasticsearch

Lucene is an open‑source library that provides inverted‑index based full‑text search. Elasticsearch builds on Lucene, adding a RESTful API, distributed capabilities, and easy installation. Solr is another Lucene‑based engine, but Elasticsearch has native clustering.

Inverted Index Basics

An inverted index lists each unique term and the documents in which it appears. Example:

Term      Doc_1  Doc_2  Doc_3
--------------------------------
Java        X
is          X      X      X
the         X      X      X
best        X      X      X
programming X      X      X
language    X      X      X
PHP                     X
Javascript                     X

Key terms: Term , Term Dictionary , Post List , Inverted File .

Elasticsearch Core Concepts

A distributed, near‑real‑time document store where every field can be indexed and searched.

Scalable to hundreds of nodes and petabytes of data.

Cluster

A cluster consists of one or more nodes sharing the same cluster.name. Nodes can be master‑eligible, data, or coordinating. Zen Discovery handles node discovery and master election.

Discovery Mechanism

Zen Discovery uses unicast or file‑based discovery. The discovery.zen.ping.unicast.hosts setting lists seed hosts.

Node Roles

Nodes can be master‑eligible ( node.master: true) and/or data nodes ( node.data: true). Separating these roles improves stability.

Split‑Brain

Network partitions can cause multiple masters. A quorum (configured via discovery.zen.minimum_master_nodes) mitigates this.

Shards and Replicas

Indexes are horizontally split into primary shards; each primary can have replica shards for high availability. Shard count is fixed at index creation.

Mapping

Mapping defines field types (e.g., text, keyword, date) and analysis. You can use dynamic mapping or explicit mapping when creating an index.

Basic Usage

Download, unzip, and start Elasticsearch with bin/elasticsearch. The REST API listens on port 9200.

{
  "name": "node1",
  "cluster_name": "elasticsearch",
  "version": { "number": "6.8.1" },
  "tagline": "You Know, for Search"
}

Check cluster health via GET /_cluster/health, which returns green, yellow, or red.

Write Path

Documents are routed to a primary shard using shard = hash(routing) % number_of_primary_shards. The coordinating node forwards the request to the primary, which writes to disk and replicates to its replicas.

Storage Mechanics

Data is stored in immutable segments on disk. Segments are written to a translog first, then refreshed (default every second) to make them searchable, and finally flushed to create a commit point.

Refresh and Flush

Refresh creates a new segment in the file‑system cache; Flush writes segments and translog to disk when the translog reaches 512 MB or 30 minutes.

Segment Merging

Background merges combine small segments into larger ones, reclaiming space from deleted documents.

Performance Optimizations

Use SSDs and avoid remote mounts.

Configure multiple path.data directories for striping.

Compress term dictionaries with FST.

Set appropriate index.refresh_interval and number_of_replicas during bulk indexing.

Prefer keyword over text when analysis isn’t needed.

Use routing values to target specific shards.

JVM Tuning

Set Xms and Xmx to the same value (≤ 50 % of RAM, ≤ 32 GB). Consider G1 GC and ensure enough heap for caching.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Search Engine Elasticsearch Mapping Sharding cluster inverted index

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.