Master Elasticsearch: Core Concepts, APIs, Mapping, and Performance Tuning
This comprehensive guide explains Elasticsearch fundamentals—including documents, indices, nodes, clusters, REST and Document APIs, query DSL, relevance scoring, distributed architecture, real‑time indexing, search execution, pagination, scroll, aggregations, data modeling, mapping options, parent/child relationships, reindexing, and practical cluster and write/read performance optimizations.
Elasticsearch Overview
Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It stores data as JSON documents, organizes them into indices, and runs on a cluster of nodes.
Basic Concepts
Document : A JSON object containing fields (e.g.,
text,
keyword,
long,
boolean,
date,
binary, range types, etc.). Each document has a unique
_idand metadata fields such as
_index,
_type,
_uid,
_source, and
_all(disabled by default).
Index : A collection of documents with the same mapping. An index can have multiple types (deprecated in 6.x) and is identified by its name.
Node : A running Elasticsearch instance that forms part of a cluster.
Cluster : A group of nodes that share the same cluster name and provide indexing and search services.
REST API
Elasticsearch exposes a RESTful HTTP API. Common methods include
GET,
POST,
PUT,
DELETE. Two main interaction styles are:
cURL : Direct command‑line requests.
Kibana DevTools : Interactive console for testing queries.
Index API
<code>PUT /test_index</code>Creates an index with default settings (5 primary shards, 1 replica).
Document API
Create a document with a specified ID:
<code>PUT /test_index/doc/1
{
"username": "alfred",
"age": 1
}</code>Create a document without specifying an ID (auto‑generated):
<code>POST /test_index/doc
{
"username": "tom",
"age": 20
}</code>Get a document:
<code>GET /test_index/doc/1</code>Search all documents:
<code>GET /test_index/doc/_search
{
"query": { "match_all": {} }
}</code>Bulk create documents:
<code>POST /_bulk
{ "index": { "_index": "test_index", "_type": "doc", "_id": "3" } }
{ "username": "alfred", "age": 10 }
{ "delete": { "_index": "test_index", "_type": "doc", "_id": "1" } }</code>Bulk get documents:
<code>GET /_mget
{
"docs": [
{ "_index": "test_index", "_type": "doc", "_id": "1" },
{ "_index": "test_index", "_type": "doc", "_id": "2" }
]
}</code>Search API
Two query contexts exist:
Query context : Calculates relevance scores and sorts results.
Filter context : Filters documents without scoring (cached for performance).
Typical search request:
<code>GET /test_index/_search
{
"query": { "match": { "remote_ip": "171.22.12.14" } }
}</code>URI Search
Parameters include
q(query string),
df(default field),
sort,
from,
size, etc.
Query DSL
JSON‑based query language with two main families:
Field‑level queries (e.g.,
term,
range) that do not analyze the query text.
Full‑text queries (e.g.,
match,
match_phrase) that first analyze the query.
Common queries:
match: Full‑text search.
term: Exact term match.
range: Numeric or date ranges.
bool: Combines
must,
should,
filter, and
must_notclauses.
Relevance Scoring
Elasticsearch uses TF/IDF (pre‑5.x) and BM25 (default from 5.x) to compute scores based on term frequency, inverse document frequency, field length, and other factors. For small datasets, set
number_of_shardsto 1 or use
search_type=dfs_query_then_fetchto get global IDF values.
Sorting
Results can be sorted by relevance (
_score), field values, or document order (
_doc). Keyword fields (
field.keyword) are required for lexical sorting of
textfields.
Pagination
From/Size : Simple pagination; deep pagination is limited by
index.max_result_window(default 10,000).
Scroll : Efficiently iterates over large result sets using a server‑side snapshot; not real‑time.
search_after : Real‑time “next page” navigation using the sort values of the last hit; avoids deep pagination overhead.
Aggregations
Aggregations provide analytics on indexed data. Four main types:
Metric : Calculations such as
min,
max,
avg,
sum,
cardinality,
stats,
percentiles,
top_hits.
Bucket : Group documents, e.g.,
terms,
range,
date_range,
histogram,
date_histogram.
Pipeline : Post‑processing on aggregation results (e.g.,
derivative,
moving_avg,
max_bucket,
min_bucket).
Matrix : Advanced multi‑dimensional analytics (not covered here).
Aggregations can be nested, allowing bucket‑plus‑metric combinations such as “average salary per job”.
Bucket Aggregations
terms: Groups by unique terms (or keywords).
range: Numeric intervals.
date_range: Date intervals with optional custom formats.
histogram: Fixed‑size numeric buckets.
date_histogram: Time‑based buckets (e.g., yearly).
Metric Aggregations
min,
max,
avg,
sum,
cardinality.
statsand
extended_stats(include variance, std‑dev).
percentilesand
percentile_ranks(approximate).
top_hitsreturns representative documents per bucket.
Pipeline Aggregations
Examples:
max_bucket/
min_bucket: Finds the bucket with the highest/lowest metric.
derivative: Computes the derivative of a metric series.
moving_avg: Calculates a moving average.
avg_bucket(DFS): Global average across buckets.
Aggregation Scope
By default, aggregations run on the query result set. Scope can be altered with:
filter: Applies a sub‑filter to a specific aggregation.
post_filter: Filters hits after aggregations have run.
global: Runs aggregation on all documents, ignoring the query.
Data Modeling
Effective Elasticsearch modeling follows three steps: conceptual, logical, and physical design. Key considerations include field types, indexing options, doc values, fielddata, and storage settings.
Mapping Field Settings
enabled: Disable entire field (true/false).
index: Build an inverted index (true/false).
index_options: Store docs, freqs, positions, or offsets.
norms: Store length normalization (true/false).
doc_values: Enable column‑ariented storage for sorting/aggregations.
fielddata: Enable in‑memory fielddata for
textfields (true/false).
store: Store the original field value separately (true/false).
coerce: Auto‑convert types (true/false).
dynamic: Control automatic mapping (
true,
false,
strict).
date_detection: Auto‑detect date strings (true/false).
Handling Relationships
Elasticsearch does not support joins like relational databases. Two common approaches are:
Nested objects : Store related objects inside a single document; useful when parent and child are updated together.
Parent/Child : Separate documents linked via a
joinfield; allows independent updates.
Example
joinmapping:
<code>PUT /blog_index_parent_child
{
"mappings": {
"doc": {
"properties": {
"join": {
"type": "join",
"relations": { "blog": "comment" }
}
}
}
}
}</code>Parent document:
<code>PUT /blog_index_parent_child/doc/1
{
"title": "blog",
"join": "blog"
}</code>Child document (routing set to parent ID):
<code>PUT /blog_index_parent_child/doc/comment-1?routing=1
{
"comment": "comment world",
"join": { "name": "comment", "parent": 1 }
}</code>Queries:
parent_id– find children of a given parent.
has_child– find parents that have matching children.
has_parent– find children whose parent matches a query.
Reindexing
Reindexing rebuilds data when mappings or settings change. Two APIs:
_update_by_query: Updates documents in place (e.g., increment a field).
_reindex: Copies data from a source index to a destination index, optionally filtering documents.
Both support asynchronous execution with
wait_for_completion=false, returning a task ID that can be monitored via the
_tasksAPI.
Cluster Tuning Recommendations
Keep
elasticsearch.ymlminimal; use APIs for dynamic settings.
Set
cluster.name,
node.name,
node.master/
node.data, and bind
network.hostto a private IP.
Configure discovery hosts and
discovery.zen.minimum_master_nodes(typically 2) to avoid split‑brain.
Allocate JVM heap ≤31 GB and reserve ~50 % for OS file cache.
Size shards based on data volume (e.g., ≤15 GB for search workloads, ≤50 GB for log workloads).
Adjust
refresh_interval(or disable with
-1) and
indices.memory.index_buffer_sizeto reduce refresh overhead.
Use async translog (
index.translog.durability=async) and increase
index.translog.flush_threshold_sizeto lower disk I/O.
Set replicas to 0 during bulk ingestion, then add them afterward.
Balance shard allocation with
index.routing.allocation.total_shards_per_nodeand monitor shard distribution.
Write Performance Optimization
Client side: use multi‑threaded bulk requests.
Increase
refresh_intervalor disable refresh during heavy indexing.
Increase
indices.memory.index_buffer_sizeto batch more documents before a refresh.
Set
index.translog.durability=asyncand a larger
index.translog.flush_threshold_sizeto reduce translog fsync frequency.
Temporarily set
number_of_replicas=0while loading data, then restore replicas.
Choose an appropriate number of primary shards; ensure even distribution across nodes.
Read Performance Optimization
Design data models that pre‑compute fields needed for scripts or aggregations.
Use filter context wherever possible; filters are cached and avoid scoring.
Avoid scripts in sorting or aggregations; store computed values as fields.
Profile slow queries with the
profileAPI to identify bottlenecks.
Set an appropriate number of replicas to improve read throughput without over‑replicating.
Keep shard sizes reasonable (15 GB for search, 50 GB for logs) to maintain query speed.
Determining the Right Number of Shards
Measure the throughput of a single‑shard, single‑node index (e.g., 10 k writes/sec). Divide the required production throughput by this baseline, then add replicas and safety margins. Ensure each shard stays within the recommended size limits.
Additional Resources
For further reading, consult the official Elasticsearch documentation, especially the sections on indexing, search, aggregations, and cluster management.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.