Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization
This article provides a comprehensive introduction to Elasticsearch, covering its underlying Lucene-based inverted index, data types, shard routing, cluster roles, discovery mechanisms, refresh and translog handling, segment merging, and practical performance and JVM tuning tips for building scalable, near‑real‑time search systems.
Elasticsearch is a Java‑based open‑source search engine built on top of Apache Lucene, providing distributed, near‑real‑time full‑text search and analytics capabilities.
Data Types : Elasticsearch distinguishes between text (analyzed for full‑text search) and keyword (exact value for filtering, sorting, and aggregations), as well as numeric, date, and other specialized types.
Index Structure : An index is divided into primary shards and replica shards. Documents are routed to a specific primary shard using the formula shard = hash(routing) % number_of_primary_shards, where routing defaults to the document _id but can be customized.
Cluster Components : Nodes can serve as master‑eligible, data, or coordinating nodes. The master node manages cluster state, shard allocation, and node discovery via the built‑in Zen Discovery module, which uses unicast ping lists and election rules to avoid split‑brain scenarios.
Write Path : Incoming documents are first written to the JVM heap and appended to the transaction log ( translog) for durability. A refresh (default every 1 s) makes newly indexed data searchable by creating a new immutable segment in the file system cache. When the translog reaches 512 MB or 30 min, a flush writes a segment to disk, creates a commit point, and clears the translog.
Segment Model : Segments are immutable on disk; deletions are recorded in a .del file, and updates are treated as delete + insert. Periodic background segment merging consolidates small segments, removes deleted documents, and reduces the number of file handles, improving search performance.
Performance Tuning :
Use SSDs and RAID 0 or multiple path.data directories for higher I/O throughput.
Adjust index.refresh_interval (e.g., 30s or -1 during bulk loads) and reduce replica count temporarily.
Prefer ordered, compressible document IDs over random UUIDs.
Disable doc_values on fields that are never aggregated or sorted.
Use keyword instead of text when full‑text analysis is unnecessary.
Leverage scroll APIs for deep pagination.
JVM Settings : Set -Xms and -Xmx to the same value (no more than 50 % of physical RAM and ≤ 32 GB). Consider the G1 garbage collector and allocate sufficient heap for indexing while leaving ample memory for the filesystem cache.
Sample Index Creation :
<code style="padding:0.5em;line-height:18px;font-size:14px;letter-spacing:0px;font-family:Consolas,Inconsolata,Courier,monospace;color:#a9b7c6;background-color:#282b2e;display:-webkit-box !important">PUT my_index
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"_doc": {
"properties": {
"title": {"type": "text"},
"name": {"type": "text"},
"age": {"type": "integer"},
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
</code>Discovery Configuration Example :
<code style="padding:0.5em;line-height:18px;font-size:14px;letter-spacing:0px;font-family:Consolas,Inconsolata,Courier,monospace;color:#a9b7c6;background-color:#282b2e;display:-webkit-box !important">discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]
</code>By understanding these core concepts and applying the recommended configuration tweaks, developers can build robust, scalable Elasticsearch clusters that deliver fast search and analytics on large volumes of structured and unstructured data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
