Backend Development 13 min read

Mastering Elasticsearch: Core Concepts and Indexing Workflow Explained

This article introduces Elasticsearch’s core concepts—including clusters, node roles, documents, mappings, and shards—and walks through the complete indexing workflow from client request to replica synchronization, highlighting key settings, routing calculations, and the role of refresh and flush operations.

MaGe Linux Operations

Dec 28, 2020

Mastering Elasticsearch: Core Concepts and Indexing Workflow Explained

What is Elasticsearch?

Elasticsearch is a near‑real‑time distributed search engine built on top of the open‑source Lucene library. It wraps Lucene and exposes a RESTful API, enabling fast full‑text search, log and metric analysis, and machine‑learning capabilities. Together with Logstash and Kibana it forms the ELK stack.

Basic Concepts

1. ES Cluster

Elasticsearch is a distributed system with high availability and scalability. Nodes can fail without affecting the service, and the cluster can be expanded horizontally by adding nodes. Different clusters are distinguished by their names; the default name is "elasticsearch".

2. ES Node

A node is a single Elasticsearch instance (a Java process). Nodes can have specific roles (master, data, ingest, machine‑learning). In production it is recommended to assign a single role per node based on data volume and query load.

Master Node – manages cluster state, creates/deletes indices, allocates shards. Setting: node.master: true.

Data Node – handles CRUD, search, aggregation; I/O‑, memory‑, CPU‑intensive. Setting: node.data: true.

Ingest Node – preprocesses data via pipelines; default role for all nodes. Setting: node.ingest: true.

Coordinating Node – receives client requests, routes them, aggregates responses. No specific setting; default when all role flags are false.

Machine‑Learning Node – runs ML jobs and API requests. Requires x‑pack enabled; setting: node.ml: true.

3. ES Document

A document is the smallest unit in Elasticsearch, stored as JSON, similar to a row in a relational database. It is self‑contained, can have hierarchical structure, and is schema‑less, allowing flexible fields.

4. ES Type

Types are logical containers for documents, analogous to tables. Since ES 7.x each index can have only one type, _doc.

5. Mapping

Mapping defines field names, data types, and index settings, similar to a database schema. Dynamic mapping can automatically add new fields, but it is often disabled in production to avoid uncontrolled cluster‑state growth.

6. Index

An index groups documents of a single type, comparable to a database.

7. Shard

A shard is a Lucene instance containing an inverted index. An index consists of primary shards and optional replica shards. Primary shard count is fixed at index creation; replicas provide high availability and improve read throughput.

Indexing Process

The following steps illustrate how a document is indexed in Elasticsearch.

Client sends request – e.g., using the Java High Level REST client.

Parameter validation – the request is checked for legality.

Pre‑processing – if a pipeline is specified, an ingest node processes the data.

Index existence check – Elasticsearch determines whether to create the index automatically and how to handle dynamic mapping.

Create index – the master node creates the index and updates the cluster state.

Routing calculation – the target shard is computed as shard_num = hash(_routing) % num_primary_shards (or with routing_partition_size).

Primary shard writes document – the document is first written to the in‑memory index buffer, later refreshed to a segment, and logged to the transaction log.

Refresh and flush – periodic refresh makes the document searchable; flush writes buffers to disk and clears the transaction log.

Replica shard writes – each replica performs the same write steps as the primary.

Response – the coordinating node returns the result to the client.

After these steps the document is searchable via Elasticsearch’s REST API.

RestClient restClient = RestClient.builder(
        new HttpHost("127.0.0.1", 9200, "http"),
        new HttpHost("127.0.0.2", 9200, "http")
    ).build();

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Indexing Search Engine Backend Development Elasticsearch cluster

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.