Big Data 58 min read

Master Elasticsearch: Core Concepts, APIs, Mapping, and Performance Tuning

This comprehensive guide explains Elasticsearch fundamentals—including documents, indices, nodes, clusters, REST and Document APIs, query DSL, relevance scoring, distributed architecture, real‑time indexing, search execution, pagination, scroll, aggregations, data modeling, mapping options, parent/child relationships, reindexing, and practical cluster and write/read performance optimizations.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Master Elasticsearch: Core Concepts, APIs, Mapping, and Performance Tuning

Elasticsearch Overview

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It stores data as JSON documents, organizes them into indices, and runs on a cluster of nodes.

Basic Concepts

Document : A JSON object containing fields (e.g.,

text

,

keyword

,

long

,

boolean

,

date

,

binary

, range types, etc.). Each document has a unique

_id

and metadata fields such as

_index

,

_type

,

_uid

,

_source

, and

_all

(disabled by default).

Index : A collection of documents with the same mapping. An index can have multiple types (deprecated in 6.x) and is identified by its name.

Node : A running Elasticsearch instance that forms part of a cluster.

Cluster : A group of nodes that share the same cluster name and provide indexing and search services.

REST API

Elasticsearch exposes a RESTful HTTP API. Common methods include

GET

,

POST

,

PUT

,

DELETE

. Two main interaction styles are:

cURL : Direct command‑line requests.

Kibana DevTools : Interactive console for testing queries.

Index API

<code>PUT /test_index</code>

Creates an index with default settings (5 primary shards, 1 replica).

Document API

Create a document with a specified ID:

<code>PUT /test_index/doc/1
{
  "username": "alfred",
  "age": 1
}</code>

Create a document without specifying an ID (auto‑generated):

<code>POST /test_index/doc
{
  "username": "tom",
  "age": 20
}</code>

Get a document:

<code>GET /test_index/doc/1</code>

Search all documents:

<code>GET /test_index/doc/_search
{
  "query": { "match_all": {} }
}</code>

Bulk create documents:

<code>POST /_bulk
{ "index": { "_index": "test_index", "_type": "doc", "_id": "3" } }
{ "username": "alfred", "age": 10 }
{ "delete": { "_index": "test_index", "_type": "doc", "_id": "1" } }</code>

Bulk get documents:

<code>GET /_mget
{
  "docs": [
    { "_index": "test_index", "_type": "doc", "_id": "1" },
    { "_index": "test_index", "_type": "doc", "_id": "2" }
  ]
}</code>

Search API

Two query contexts exist:

Query context : Calculates relevance scores and sorts results.

Filter context : Filters documents without scoring (cached for performance).

Typical search request:

<code>GET /test_index/_search
{
  "query": { "match": { "remote_ip": "171.22.12.14" } }
}</code>

URI Search

Parameters include

q

(query string),

df

(default field),

sort

,

from

,

size

, etc.

Query DSL

JSON‑based query language with two main families:

Field‑level queries (e.g.,

term

,

range

) that do not analyze the query text.

Full‑text queries (e.g.,

match

,

match_phrase

) that first analyze the query.

Common queries:

match

: Full‑text search.

term

: Exact term match.

range

: Numeric or date ranges.

bool

: Combines

must

,

should

,

filter

, and

must_not

clauses.

Relevance Scoring

Elasticsearch uses TF/IDF (pre‑5.x) and BM25 (default from 5.x) to compute scores based on term frequency, inverse document frequency, field length, and other factors. For small datasets, set

number_of_shards

to 1 or use

search_type=dfs_query_then_fetch

to get global IDF values.

Sorting

Results can be sorted by relevance (

_score

), field values, or document order (

_doc

). Keyword fields (

field.keyword

) are required for lexical sorting of

text

fields.

Pagination

From/Size : Simple pagination; deep pagination is limited by

index.max_result_window

(default 10,000).

Scroll : Efficiently iterates over large result sets using a server‑side snapshot; not real‑time.

search_after : Real‑time “next page” navigation using the sort values of the last hit; avoids deep pagination overhead.

Aggregations

Aggregations provide analytics on indexed data. Four main types:

Metric : Calculations such as

min

,

max

,

avg

,

sum

,

cardinality

,

stats

,

percentiles

,

top_hits

.

Bucket : Group documents, e.g.,

terms

,

range

,

date_range

,

histogram

,

date_histogram

.

Pipeline : Post‑processing on aggregation results (e.g.,

derivative

,

moving_avg

,

max_bucket

,

min_bucket

).

Matrix : Advanced multi‑dimensional analytics (not covered here).

Aggregations can be nested, allowing bucket‑plus‑metric combinations such as “average salary per job”.

Bucket Aggregations

terms

: Groups by unique terms (or keywords).

range

: Numeric intervals.

date_range

: Date intervals with optional custom formats.

histogram

: Fixed‑size numeric buckets.

date_histogram

: Time‑based buckets (e.g., yearly).

Metric Aggregations

min

,

max

,

avg

,

sum

,

cardinality

.

stats

and

extended_stats

(include variance, std‑dev).

percentiles

and

percentile_ranks

(approximate).

top_hits

returns representative documents per bucket.

Pipeline Aggregations

Examples:

max_bucket

/

min_bucket

: Finds the bucket with the highest/lowest metric.

derivative

: Computes the derivative of a metric series.

moving_avg

: Calculates a moving average.

avg_bucket

(DFS): Global average across buckets.

Aggregation Scope

By default, aggregations run on the query result set. Scope can be altered with:

filter

: Applies a sub‑filter to a specific aggregation.

post_filter

: Filters hits after aggregations have run.

global

: Runs aggregation on all documents, ignoring the query.

Data Modeling

Effective Elasticsearch modeling follows three steps: conceptual, logical, and physical design. Key considerations include field types, indexing options, doc values, fielddata, and storage settings.

Mapping Field Settings

enabled

: Disable entire field (true/false).

index

: Build an inverted index (true/false).

index_options

: Store docs, freqs, positions, or offsets.

norms

: Store length normalization (true/false).

doc_values

: Enable column‑ariented storage for sorting/aggregations.

fielddata

: Enable in‑memory fielddata for

text

fields (true/false).

store

: Store the original field value separately (true/false).

coerce

: Auto‑convert types (true/false).

dynamic

: Control automatic mapping (

true

,

false

,

strict

).

date_detection

: Auto‑detect date strings (true/false).

Handling Relationships

Elasticsearch does not support joins like relational databases. Two common approaches are:

Nested objects : Store related objects inside a single document; useful when parent and child are updated together.

Parent/Child : Separate documents linked via a

join

field; allows independent updates.

Example

join

mapping:

<code>PUT /blog_index_parent_child
{
  "mappings": {
    "doc": {
      "properties": {
        "join": {
          "type": "join",
          "relations": { "blog": "comment" }
        }
      }
    }
  }
}</code>

Parent document:

<code>PUT /blog_index_parent_child/doc/1
{
  "title": "blog",
  "join": "blog"
}</code>

Child document (routing set to parent ID):

<code>PUT /blog_index_parent_child/doc/comment-1?routing=1
{
  "comment": "comment world",
  "join": { "name": "comment", "parent": 1 }
}</code>

Queries:

parent_id

– find children of a given parent.

has_child

– find parents that have matching children.

has_parent

– find children whose parent matches a query.

Reindexing

Reindexing rebuilds data when mappings or settings change. Two APIs:

_update_by_query

: Updates documents in place (e.g., increment a field).

_reindex

: Copies data from a source index to a destination index, optionally filtering documents.

Both support asynchronous execution with

wait_for_completion=false

, returning a task ID that can be monitored via the

_tasks

API.

Cluster Tuning Recommendations

Keep

elasticsearch.yml

minimal; use APIs for dynamic settings.

Set

cluster.name

,

node.name

,

node.master

/

node.data

, and bind

network.host

to a private IP.

Configure discovery hosts and

discovery.zen.minimum_master_nodes

(typically 2) to avoid split‑brain.

Allocate JVM heap ≤31 GB and reserve ~50 % for OS file cache.

Size shards based on data volume (e.g., ≤15 GB for search workloads, ≤50 GB for log workloads).

Adjust

refresh_interval

(or disable with

-1

) and

indices.memory.index_buffer_size

to reduce refresh overhead.

Use async translog (

index.translog.durability=async

) and increase

index.translog.flush_threshold_size

to lower disk I/O.

Set replicas to 0 during bulk ingestion, then add them afterward.

Balance shard allocation with

index.routing.allocation.total_shards_per_node

and monitor shard distribution.

Write Performance Optimization

Client side: use multi‑threaded bulk requests.

Increase

refresh_interval

or disable refresh during heavy indexing.

Increase

indices.memory.index_buffer_size

to batch more documents before a refresh.

Set

index.translog.durability=async

and a larger

index.translog.flush_threshold_size

to reduce translog fsync frequency.

Temporarily set

number_of_replicas=0

while loading data, then restore replicas.

Choose an appropriate number of primary shards; ensure even distribution across nodes.

Read Performance Optimization

Design data models that pre‑compute fields needed for scripts or aggregations.

Use filter context wherever possible; filters are cached and avoid scoring.

Avoid scripts in sorting or aggregations; store computed values as fields.

Profile slow queries with the

profile

API to identify bottlenecks.

Set an appropriate number of replicas to improve read throughput without over‑replicating.

Keep shard sizes reasonable (15 GB for search, 50 GB for logs) to maintain query speed.

Determining the Right Number of Shards

Measure the throughput of a single‑shard, single‑node index (e.g., 10 k writes/sec). Divide the required production throughput by this baseline, then add replicas and safety margins. Ensure each shard stays within the recommended size limits.

Additional Resources

For further reading, consult the official Elasticsearch documentation, especially the sections on indexing, search, aggregations, and cluster management.

search engineElasticsearchdata modelingAggregationCluster Tuning
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.