Databases 28 min read

Mastering Elasticsearch: Core Concepts, Indexing, and Performance Tips

This article explains Elasticsearch fundamentals—including clusters, nodes, shards, replicas, segments, routing, indexing, updates, deletions, search flow, pagination methods, and visualization tools—while offering practical optimization recommendations for high‑performance and scalable deployments.

Architect's Alchemy Furnace

Mar 4, 2024

Mastering Elasticsearch: Core Concepts, Indexing, and Performance Tips

1. Basic Concepts

1.1 Terminology

Cluster : a group of one or more Elasticsearch nodes.

Node : a service unit of an Elasticsearch cluster; each node can host one or more shards, and node names must be unique within the cluster.

Shards : horizontal slices of an index created when the data volume is large. Each shard is a Lucene index and consists of a primary shard and zero or more replica shards.

When writing data to a multi‑shard index, routing determines the target shard; the number of shards is fixed at index creation and cannot be changed.

Replica : a backup of a primary shard. Replicas provide high availability and load‑balancing capabilities but increase write overhead.

Segment

Elasticsearch stores index data in immutable Lucene segments. New segments are created during a refresh, and deleted documents are removed during segment merging.

Translog

The transaction log records operations that have not yet been persisted to disk. It is flushed to disk periodically (default every 5 seconds). During recovery, Elasticsearch replays the translog to avoid data loss.

Index

An index is uniquely identified by its name and consists of one or more shards.

Type

A logical partition inside an index (deprecated after version 6.x).

Document

A document is a single record in an index, identified by _id.

Settings

Defines default number of shards, replicas, refresh interval, routing allocation, etc.

Mapping

Defines field types, analyzers, and storage options. Dynamic mapping is enabled by default but can be overridden with explicit mappings.

1.2 Node Roles

Master : manages cluster state, primary and replica shard allocation, and performs master‑eligible node election (only one master at a time).

Data : stores data, maintains inverted indexes, and handles search and aggregation. Requires sufficient CPU, memory, and I/O resources.

Client (Coordinating) : receives user requests, routes them to appropriate data nodes, and merges the responses.

1.3 Cluster Health

Health can be green (all shards allocated), yellow (all primary shards allocated but some replicas missing), or red (one or more primary shards missing).

2 Basic Operations

2.1 Create Index

Use the REST API PUT /_template/... for template‑based creation or PUT /index_name for manual creation. Example JSON for a template and a manual index creation is provided, specifying settings (number_of_shards, number_of_replicas, refresh_interval, routing allocation) and mappings (field types such as date, keyword, double, byte).

2.1.3 Mapping Recommendations

Avoid dynamic mapping; specify field types explicitly.

Use keyword for non‑analyzed strings and text for analyzed strings.

Limit total fields (default 1000, recommended ≤100).

Disable norms for fields that do not need scoring.

Disable doc_values for fields not used in sorting or aggregations.

Avoid nested and parent/child relationships unless required.

2.2 Write Path

Document routing formula: shard = hash(routing) % number_of_primary_shards. The routing value defaults to _id but can be customized.

Client sends a request to a coordinating node, which routes it to the primary shard. The primary shard writes the document to memory and the translog, then replicates it to replica nodes. A refresh (default every 1 s) creates a new immutable segment that becomes searchable. A flush (triggered when the translog reaches 512 MB or after 30 minutes) writes a new segment to disk, performs an fsync, creates a commit point, and clears the translog.

2.3 Update & Delete

Delete : creates a .del file entry marking the document as deleted; the document is physically removed only during the next segment merge.

Update : implemented as delete + add. A new version of the document is indexed, the old version is marked deleted in .del, and the new version is searchable. The old version remains on disk until a segment merge occurs.

2.4 Search

2.4.1 Search by _id

Client queries any node; the coordinating node determines the shard containing the document (using the hash of _id) and forwards the request to a node that holds the primary or a replica shard. The node returns the document to the coordinating node, which then returns it to the client.

2.4.2 Regular Query Flow

1. Client sends a request to any node, which becomes the coordinating node. 2. The coordinating node parses the query, determines the relevant primary and replica shards, and forwards the request to each shard. 3. Each shard returns matching document IDs, sort values, and other metadata. 4. The coordinating node merges, sorts, and paginates the results, extracting the final list of document IDs. 5. The coordinating node fetches the full source of those documents from the appropriate shards. 6. The complete documents are returned to the client.

2.4.3 Optimization Tips

Prefer filter clauses over query when only a boolean match is needed; filters are cacheable and do not compute relevance scores.

Increase index.refresh_interval to reduce the number of segments and lower I/O costs.

Use the bulk API to batch index, update, or delete operations, which reduces network overhead.

Use auto‑generated IDs to avoid the expensive existence check performed when a custom ID is supplied.

Adjust index.translog.sync_interval (default 5 s, minimum 100 ms) to balance durability and performance.

Avoid large documents that strain network, memory, and disk.

Explicitly define mappings (strict) to ensure optimal field types, storage, and performance.

Limit the number of replicas to the required level; more replicas improve availability and read parallelism but consume additional storage and CPU.

Retrieve only needed fields using stored_fields instead of fetching the entire source.

Avoid wildcard queries when possible; they are costly.

Enable query cache for frequently used filter contexts, while monitoring memory usage.

2.4.4 Pagination

From + Size : simple offset‑based pagination; limited by max_result_window (default 10 000). Deep pagination becomes slower.

search_after : uses the sort values from the previous page to retrieve the next page; not limited by max_result_window for total pages, but each request cannot exceed the window size. Supports only forward navigation.

Scroll : provides a snapshot‑like cursor for deep traversal of large result sets. Each scroll request is limited by max_result_window and keeps a scroll context in heap memory, which must be cleared when no longer needed.

3 Visualization Tools

elasticsearch‑head : a web UI for cluster inspection ( GitHub ).

Kibana : the official Elasticsearch UI for data exploration and visualization.

Elasticsearch Cluster Pagination Search

Written by

Architect's Alchemy Furnace

A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.