Elasticsearch Overview: Core Concepts, Architecture, and Performance Optimization
This article provides a comprehensive overview of Elasticsearch, covering its data types, Lucene-based inverted index, cluster architecture, sharding and replication mechanisms, mapping definitions, basic usage, health monitoring, storage internals, and practical performance tuning tips for large‑scale search deployments.
Data in Everyday Life
Data can be structured (tables stored in relational databases) or unstructured (documents, images, audio, video, etc.). Search over these leads to structured‑data search and full‑text search for unstructured data.
Lucene and Inverted Index
Apache Lucene provides the core inverted‑index technology used by Elasticsearch. An inverted index maps each term to the documents that contain it, enabling fast full‑text retrieval.
Term Doc_1 Doc_2 Doc_3
-------------------------------------
Java | X | |
is | X | X | X
the | X | X | X
best | X | X | X
programming | X | X | X
language | X | X | X
PHP | | X |
Javascript | | | XKey terminology includes Term , Term Dictionary , Post List , and Inverted File .
Elasticsearch Core Concepts
Elasticsearch is a distributed, near‑real‑time search and analytics engine built on Lucene. It offers a simple RESTful API, automatic clustering, sharding, replication, and high availability.
Cluster
A cluster consists of one or more nodes that share the same cluster.name. Node discovery and master election are handled by Zen Discovery, which supports unicast and file‑based discovery.
discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]Node roles are configured in elasticsearch.yml (e.g., node.master: true, node.data: true).
Sharding and Replicas
Indices are split into primary shards; each primary can have replica shards for fault tolerance. Document routing uses the formula:
shard = hash(routing) % number_of_primary_shardsRouting defaults to the document _id but can be customized.
PUT /myIndex
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
}
}Mapping
Mappings define field types (text, keyword, integer, date, etc.) and can be dynamic or explicit. Example of an explicit mapping:
PUT my_index
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"_doc": {
"properties": {
"title": {"type": "text"},
"name": {"type": "text"},
"age": {"type": "integer"},
"created": {"type": "date", "format": "strict_date_optional_time||epoch_millis"}
}
}
}
}Basic Usage
Download and unzip Elasticsearch, then start it with bin/elasticsearch. The default HTTP port is 9200; accessing http://localhost:9200 returns a JSON object with cluster name, node name, version, and tagline.
Cluster Health
Health status is reported as green (all primary and replica shards active), yellow (all primaries active but some replicas missing), or red (one or more primary shards unavailable).
Write Path and Storage
When a document is indexed, it is first written to memory and appended to the transaction log (translog). Periodically (default 1 s or when memory thresholds are reached) a refresh creates a new immutable segment in the file‑system cache, making the data searchable. When the translog reaches 512 MB or 30 min, a flush writes the segment to disk, creates a commit point, and clears the translog.
Segments are immutable; deletions are recorded in a .del file and physically removed only during background segment merging, which also consolidates small segments into larger ones to reduce file‑handle and CPU overhead.
Performance Optimizations
Use SSDs, RAID‑0, or multiple path.data directories to maximize I/O throughput.
Avoid remote mounts (NFS/SMB) and be cautious with cloud block storage such as AWS EBS.
Compress term dictionaries with FST, tune index.refresh_interval, and disable replicas during bulk indexing (set index.number_of_replicas: 0).
Allocate JVM heap (Xms = Xmx) to no more than 50 % of physical RAM and consider G1GC for better pause‑time behavior.
Prefer keyword fields over text when sorting/aggregating, and disable doc values on fields that do not require them.
Use scroll APIs instead of deep pagination to avoid costly from+size queries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
