Backend Development 39 min read

Unlocking Elasticsearch: Core Concepts, Architecture, and Performance Tips

This comprehensive guide explains Elasticsearch’s role in searching structured and unstructured data, covers Lucene’s inverted index, details cluster components, shard and replica mechanics, mapping types, installation steps, indexing workflow, storage strategies, and practical performance optimizations for real‑world deployments.

dbaplus Community

Jul 26, 2022

Unlocking Elasticsearch: Core Concepts, Architecture, and Performance Tips

1. Data Types and Search Basics

Data in everyday life falls into two categories: structured data (rows stored in relational databases) and unstructured data (free‑form text, documents, images, videos, etc.). Correspondingly, search can be structured‑data search (using SQL or indexes) or full‑text search for unstructured data.

Full‑text search relies on building an inverted index that maps each term to the documents containing it, enabling fast lookup compared with sequential scanning.

2. Lucene and the Inverted Index

Apache Lucene is the open‑source library that provides the core inverted‑index functionality. An inverted index consists of a Term Dictionary (a sorted list of unique terms) and a Post List (for each term, the list of documents and positions where it appears). The physical representation is an Inverted File stored on disk.

CopyTerm          Doc_1    Doc_2   Doc_3
-------------------------------------
Java        |   X   |
is          |   X   |   X    X
the         |   X   |   X    |   X
best        |   X   |   X    |   X
programming |   X   |   X    |   X
language    |   X   |   X    |   X
PHP         |       |   X    |
Javascript  |       |       |   X
-------------------------------------

3. Elasticsearch Core Concepts

Elasticsearch is a Java‑based distributed search engine built on top of Lucene. It provides a RESTful API that abstracts Lucene’s complexity.

Distributed real‑time document store where every field can be indexed and searched.

Scalable search engine capable of handling petabytes of structured or unstructured data.

3.1 Cluster and Nodes

A cluster consists of one or more nodes sharing the same cluster.name. Each node runs an instance of Elasticsearch and can assume multiple roles:

Master‑eligible node : participates in cluster state election and manages shard allocation.

Data node : stores shards and handles indexing, search, and aggregation.

Coordinating node : receives client requests, routes them to the appropriate shards, and merges results (any node can act as a coordinator).

Discovery is handled by the built‑in Zen Discovery module, which uses unicast lists (or file‑based lists) to find other nodes. Example configuration:

discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

To avoid split‑brain scenarios, set discovery.zen.minimum_master_nodes (often (master_eligible_nodes/2) + 1) so that a majority of master‑eligible nodes must be reachable before a master is elected.

3.2 Node Roles

node.master: true   // master‑eligible
node.data: true    // data node

Separating master‑eligible nodes from heavy data nodes improves cluster stability.

3.3 Split‑Brain Prevention

Increase discovery.zen.ping_timeout (e.g., 6s) to reduce false‑negative master detection.

Configure discovery.zen.minimum_master_nodes appropriately.

Separate master and data roles.

4. Shards and Replicas

Elasticsearch splits an index into primary shards (default 5) and optional replica shards for high availability. Shard count is defined at index creation and cannot be changed later.

PUT /myIndex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Writes are directed to the primary shard, then replicated to its replicas. Replicas are never placed on the same node as their primary.

5. Mapping and Field Types

Mapping defines how each field is stored, analyzed, and indexed—similar to a database schema. Elasticsearch supports dynamic mapping (automatic) and explicit mapping (user‑defined).

Common field types (ES 6.8): text: full‑text searchable, analyzed. keyword: exact‑match, suitable for filtering, sorting, aggregations. integer, date, etc.

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": { "type": "text" },
        "name":  { "type": "text" },
        "age":   { "type": "integer" },
        "created": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
  }
}

6. Basic Usage

Download and extract Elasticsearch; no installation wizard is required.

Directory layout: bin: executables (e.g., bin/elasticsearch) config: configuration files data: stored indices logs: log files plugins: optional extensions

Start a node: bin/elasticsearch Default HTTP port is 9200. Querying http://localhost:9200 returns cluster information such as name, UUID, version, and tagline.

{
  "name" : "U7fp3O9",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "-Rj8jGQvRIelGd9ckicUOA",
  "version" : {
    "number" : "6.8.1",
    "build_flavor" : "default",
    "build_type" : "zip",
    "build_hash" : "1fad4e1",
    "build_date" : "2019-06-18T13:16:52.517138Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Check cluster health via:

GET /_cluster/health

{
  "cluster_name" : "wujiajian",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 9,
  "active_shards" : 9,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 64.28571428571429
}

Health colors: green (all shards active), yellow (all primaries active, some replicas missing), red (one or more primaries missing).

7. Indexing Mechanics

7.1 Write Path

When a document is indexed, the routing value (default _id) is hashed and modulo‑ed by the number of primary shards to determine the target shard:

shard = hash(routing) % number_of_primary_shards

The coordinating node calculates the shard, forwards the request to the primary shard, which writes to memory and the transaction log, then replicates to its replicas. Only after all replicas acknowledge does the primary report success.

7.2 Storage Model

Elasticsearch stores indexed data in immutable segments . A segment is a self‑contained inverted index on disk. New documents create new segments; updates are implemented as a delete marker plus a new segment; deletions add entries to a .del file.

Segments are periodically flushed (fsync to disk) and merged into larger segments, discarding deleted documents.

7.3 Transaction Log (Translog)

Every write is first appended to the translog to guarantee durability before the data is flushed to a segment. When a refresh occurs (default every second), a new segment is created from memory and made searchable. When the translog reaches 512 MB or 30 minutes, a flush writes pending data to a segment, fsyncs, creates a commit point, and clears the translog.

7.4 Refresh and Flush

POST /_refresh          // refresh all indices
POST /my_index/_refresh // refresh a specific index

Refresh is lightweight but still incurs I/O; avoid manual refresh on every document in production.

8. Segment Merging

Background merge threads combine small segments into larger ones, reclaiming space from deleted documents. Merges are resource‑controlled to avoid starving search threads.

9. Performance Optimizations

9.1 Storage

Use SSDs and RAID 0 for maximum I/O throughput.

Avoid remote mounts (NFS/SMB) and be cautious with cloud block storage (e.g., EBS).

9.2 Index Internals

Prefer sequential IDs over random UUIDs to improve term dictionary compression.

Disable doc_values on fields that never need sorting or aggregations.

Use keyword instead of text for exact‑match fields.

Adjust index.refresh_interval (e.g., 30s or -1 during bulk loads) and temporarily set number_of_replicas to 0 for faster indexing.

Prefer scroll for deep pagination to avoid large from+size queues.

Limit mapped fields to only those required for search, aggregation, or sorting.

Specify routing values when possible to target specific shards.

9.3 JVM Tuning

Set identical -Xms and -Xmx (typically ≤ 50 % of physical RAM, max 32 GB).

Consider G1GC instead of the default CMS collector.

Ensure sufficient free RAM for the operating system’s filesystem cache.

By applying these strategies, Elasticsearch can achieve high throughput, low latency, and robust fault tolerance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems indexing search engine Elasticsearch

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.