How Elasticsearch Powers Real-Time Search: Core Concepts and Best Practices
This article provides a comprehensive overview of Elasticsearch, explaining its underlying Lucene technology, data modeling, cluster architecture, shard and replica mechanisms, indexing workflow, storage strategies, refresh and translog processes, as well as practical performance and JVM tuning tips for building scalable, near‑real‑time search solutions.
1. Data in Everyday Life
Search engines retrieve data, which can be divided into two categories: structured data (tables stored in relational databases) and unstructured data (documents, emails, images, audio, video, etc.). Unstructured data can also be considered semi‑structured when formats like XML or HTML are involved.
2. Introduction to Lucene
Relational databases cannot handle unstructured data, so full‑text search engines are needed. Apache Lucene is an open‑source library that provides inverted index capabilities. Solr and Elasticsearch are built on top of Lucene, with Elasticsearch offering built‑in distribution and easy installation.
Lucene creates an inverted index by tokenizing documents into terms and recording which documents contain each term.
Term Doc_1 Doc_2 Doc_3
--------------------------------
Java | X | |
is | X | X | X
the | X | X | X
best | X | X | X
programming | X | X | X
language | X | X | X
PHP | | X |
Javascript | | | X
--------------------------------Key terms include Term, Term Dictionary, Post List (inverted list), and Inverted File.
3. Core Concepts of Elasticsearch (ES)
Elasticsearch is a Java‑based open‑source search engine that wraps Lucene, exposing a simple RESTful API. It is a distributed, real‑time document store and analytics engine capable of handling PB‑scale structured and unstructured data.
Cluster
A cluster consists of one or more nodes sharing the same cluster.name. Nodes discover each other via Zen Discovery (default unicast). The master node manages index creation, shard allocation, and cluster state.
Node Roles
Master‑eligible node (node.master)
Data node (node.data)
Master nodes handle cluster coordination; data nodes store shards. Mixing roles can affect stability, so it is recommended to separate master‑eligible nodes from heavy data nodes.
Split‑Brain Prevention
To avoid multiple masters, Elasticsearch uses a quorum defined by discovery.zen.minimum_master_nodes. The cluster remains functional as long as a majority of master‑eligible nodes are reachable.
4. Shards and Replicas
Indexes are horizontally split into primary shards; each primary can have replica shards for high availability. The number of primary shards is fixed at index creation.
PUT /myIndex
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
}
}Writes go to the primary shard and are then replicated to its replicas. Reads can be served by any shard copy.
5. Mapping
Mapping defines field types, analyzers, and storage options, similar to a database schema. Types include text (full‑text searchable) and keyword (exact match, suitable for sorting and aggregations). Mapping can be dynamic or explicit.
PUT my_index
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"_doc": {
"properties": {
"title": {"type": "text"},
"name": {"type": "text"},
"age": {"type": "integer"},
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}6. Basic Usage
Download, unzip, and start Elasticsearch with bin/elasticsearch. By default it listens on port 9200.
{
"name" : "U7fp3O9",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "-Rj8jGQvRIelGd9ckicUOA",
"version" : {
"number" : "6.8.1",
"build_flavor" : "default",
"build_type" : "zip",
"build_hash" : "1fad4e1",
"build_date" : "2019-06-18T13:16:52.517138Z",
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}Cluster Health
Health status can be green, yellow, or red, indicating full functionality, partial replica loss, or critical failures respectively.
{
"cluster_name" : "wujiajian",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 9,
"active_shards" : 9,
"unassigned_shards" : 5,
"active_shards_percent_as_number" : 64.28571428571429
}7. Internal Mechanisms
Write Path
Documents are routed to a primary shard using shard = hash(routing) % number_of_primary_shards, where routing defaults to the document _id. The coordinating node forwards the request to the appropriate primary, which writes to disk and replicates to its replicas.
Storage Model
Indexes are stored as immutable segments on disk. Segments are written to a translog first, then periodically refreshed (creating a new segment) and flushed (fsync to disk). Deletions are recorded in .del files; updates are delete‑plus‑add.
Refresh and Flush
Refresh makes recent writes searchable (default every second). Flush persists data to disk and clears the translog (triggered when the translog exceeds 512 MB or 30 minutes).
Segment Merging
Background merges combine small segments into larger ones, reclaiming space from deleted documents and reducing the number of file handles.
8. Performance Optimization
Hardware
Use SSDs and RAID 0 for high I/O throughput.
Avoid remote network mounts (NFS, SMB).
Prefer local instance storage over cloud block storage when possible.
Index Internals
Lucene stores terms in a sorted dictionary with a compressed Finite State Transducer (FST) index, enabling fast binary search while keeping the structure memory‑efficient.
Configuration Tweaks
Use sequential, compressible IDs instead of random UUIDs.
Disable doc values on fields that are not used for sorting or aggregations.
Prefer keyword over text when full‑text search is unnecessary.
Adjust index.refresh_interval for bulk indexing (e.g., set to -1 to disable).
Set index.number_of_replicas to 0 during massive imports, then restore.
Use scroll APIs instead of deep pagination with from+size.
Limit mapping fields to those required for search, aggregation, or sorting.
Provide explicit routing values to target specific shards.
JVM Tuning
Set -Xms and -Xmx to the same value, not exceeding 50 % of physical RAM and 32 GB.
Consider G1GC over CMS for reduced stop‑the‑world pauses.
Ensure ample free memory for the operating system’s file‑system cache.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
