Big Data 70 min read

Master Elasticsearch: Core Concepts, Architecture, and Performance Tips

This comprehensive guide explains Elasticsearch’s role in handling massive data, covering its architecture, core concepts like shards and replicas, indexing and search processes, performance considerations, common pitfalls, and practical interview insights for developers and engineers.

Intelligent Backend & Architecture
Intelligent Backend & Architecture
Intelligent Backend & Architecture
Master Elasticsearch: Core Concepts, Architecture, and Performance Tips

Elasticsearch Overview

Elasticsearch is an open‑source distributed search and analytics engine built on Apache Lucene. It is designed for real‑time full‑text search on large volumes of data, providing a simple RESTful JSON API.

Core Concepts

Data is stored in indices , each of which is divided into primary shards and optional replica shards . Shards are the basic units of storage and parallelism; replicas provide high availability and load‑balancing.

Each document is indexed into a primary shard determined by the routing formula:

shard = hash(routing) % number_of_primary_shards

Routing defaults to the document _id but can be customized to improve query locality.

Cluster Architecture

A cluster consists of one or more master‑eligible nodes that manage cluster state, and data nodes that hold shards. Nodes discover each other via the built‑in Zen Discovery mechanism, using the cluster.name setting to join the same cluster.

Indexing Process

When a document is indexed, it is first written to an in‑memory buffer and to the transaction log (translog) . Every second the buffer is refreshed , creating a new segment that is written to the OS page cache, making the document searchable (near‑real‑time). When the translog grows beyond a threshold (default 512 MB or 30 min), a flush occurs: the buffer is written to a new segment on disk, a commit point is created, and the translog is cleared.

Segments are immutable; Elasticsearch periodically merges small segments into larger ones to reduce file handles and improve query performance. Deleted or updated documents are marked in a .del file and physically removed during segment merges.

Search Process

A search request is sent to any node, which acts as a coordinating node . The coordinator routes the query to the relevant shards (primary or replica). Each shard executes the query locally, returns a sorted list of matching document IDs, and the coordinator merges the results, performs global sorting, pagination, and finally fetches the full documents from the shards.

Performance Considerations

Refresh interval : default 1 s; can be increased for bulk indexing to reduce overhead.

Bulk API : use for high‑throughput writes, optionally setting replicas=0 and refresh_interval=-1 during the load.

Fielddata vs Doc Values : avoid fielddata for sorting/aggregations on large fields; use doc values (default in recent versions) to keep data off the JVM heap.

Deep pagination : avoid large from values; use scroll or search_after for efficient deep traversal.

Shard sizing : keep shard size reasonable (e.g., 30‑50 GB) and use the shrink API for cold data.

Common Pitfalls

Real‑time requirements may necessitate querying the primary data store instead of Elasticsearch due to the 1‑second refresh delay. Deep pagination can cause high memory and CPU usage, and the default max_result_window limits pagination to 10 000 results.

Interview Insights

Typical interview questions cover Elasticsearch’s write path (buffer → translog → refresh → segment → flush), search path (query phase → fetch phase), shard routing, cluster master election, and the underlying Lucene data structures such as inverted indexes, FST, and block k‑d trees for numeric range queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceindexingsearch engine
Intelligent Backend & Architecture
Written by

Intelligent Backend & Architecture

We share personal insights on intelligent, automated backend technologies, along with practical AI knowledge, algorithms, and architecture design, grounded in real business scenarios.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.