Big Data 58 min read

Master Elasticsearch: Core Concepts, APIs, Mapping, and Performance Tuning

This comprehensive guide explains Elasticsearch fundamentals—including documents, indices, nodes, clusters, REST and Document APIs, query DSL, relevance scoring, distributed architecture, real‑time indexing, search execution, pagination, scroll, aggregations, data modeling, mapping options, parent/child relationships, reindexing, and practical cluster and write/read performance optimizations.

Ops Development Stories

Feb 24, 2022

Master Elasticsearch: Core Concepts, APIs, Mapping, and Performance Tuning

Elasticsearch Overview

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It stores data as JSON documents, organizes them into indices, and runs on a cluster of nodes.

Basic Concepts

Document : A JSON object containing fields (e.g., text, keyword, long, boolean, date, binary, range types, etc.). Each document has a unique _id and metadata fields such as _index, _type, _uid, _source, and _all (disabled by default).

Index : A collection of documents with the same mapping. An index can have multiple types (deprecated in 6.x) and is identified by its name.

Node : A running Elasticsearch instance that forms part of a cluster.

Cluster : A group of nodes that share the same cluster name and provide indexing and search services.

REST API

Elasticsearch exposes a RESTful HTTP API. Common methods include GET, POST, PUT, DELETE. Two main interaction styles are:

cURL : Direct command‑line requests.

Kibana DevTools : Interactive console for testing queries.

Index API

PUT /test_index

Creates an index with default settings (5 primary shards, 1 replica).

Document API

Create a document with a specified ID:

PUT /test_index/doc/1
{
  "username": "alfred",
  "age": 1
}

Create a document without specifying an ID (auto‑generated):

POST /test_index/doc
{
  "username": "tom",
  "age": 20
}

Get a document: GET /test_index/doc/1 Search all documents:

GET /test_index/doc/_search
{
  "query": { "match_all": {} }
}

Bulk create documents:

POST /_bulk
{ "index": { "_index": "test_index", "_type": "doc", "_id": "3" } }
{ "username": "alfred", "age": 10 }
{ "delete": { "_index": "test_index", "_type": "doc", "_id": "1" } }

Bulk get documents:

GET /_mget
{
  "docs": [
    { "_index": "test_index", "_type": "doc", "_id": "1" },
    { "_index": "test_index", "_type": "doc", "_id": "2" }
  ]
}

Search API

Two query contexts exist:

Query context : Calculates relevance scores and sorts results.

Filter context : Filters documents without scoring (cached for performance).

Typical search request:

GET /test_index/_search
{
  "query": { "match": { "remote_ip": "171.22.12.14" } }
}

URI Search

Parameters include q (query string), df (default field), sort, from, size, etc.

Query DSL

JSON‑based query language with two main families:

Field‑level queries (e.g., term, range) that do not analyze the query text.

Full‑text queries (e.g., match, match_phrase) that first analyze the query.

Common queries: match: Full‑text search. term: Exact term match. range: Numeric or date ranges. bool: Combines must, should, filter, and must_not clauses.

Relevance Scoring

Elasticsearch uses TF/IDF (pre‑5.x) and BM25 (default from 5.x) to compute scores based on term frequency, inverse document frequency, field length, and other factors. For small datasets, set number_of_shards to 1 or use search_type=dfs_query_then_fetch to get global IDF values.

Sorting

Results can be sorted by relevance ( _score), field values, or document order ( _doc). Keyword fields ( field.keyword) are required for lexical sorting of text fields.

Pagination

From/Size : Simple pagination; deep pagination is limited by index.max_result_window (default 10,000).

Scroll : Efficiently iterates over large result sets using a server‑side snapshot; not real‑time.

search_after : Real‑time “next page” navigation using the sort values of the last hit; avoids deep pagination overhead.

Aggregations

Aggregations provide analytics on indexed data. Four main types:

Metric : Calculations such as min, max, avg, sum, cardinality, stats, percentiles, top_hits.

Bucket : Group documents, e.g., terms, range, date_range, histogram, date_histogram.

Pipeline : Post‑processing on aggregation results (e.g., derivative, moving_avg, max_bucket, min_bucket).

Matrix : Advanced multi‑dimensional analytics (not covered here).

Aggregations can be nested, allowing bucket‑plus‑metric combinations such as “average salary per job”.

Bucket Aggregations

terms

: Groups by unique terms (or keywords). range: Numeric intervals. date_range: Date intervals with optional custom formats. histogram: Fixed‑size numeric buckets. date_histogram: Time‑based buckets (e.g., yearly).

Metric Aggregations

min

, max, avg, sum, cardinality. stats and extended_stats (include variance, std‑dev). percentiles and percentile_ranks (approximate). top_hits returns representative documents per bucket.

Pipeline Aggregations

Examples: max_bucket / min_bucket: Finds the bucket with the highest/lowest metric. derivative: Computes the derivative of a metric series. moving_avg: Calculates a moving average. avg_bucket (DFS): Global average across buckets.

Aggregation Scope

By default, aggregations run on the query result set. Scope can be altered with: filter: Applies a sub‑filter to a specific aggregation. post_filter: Filters hits after aggregations have run. global: Runs aggregation on all documents, ignoring the query.

Data Modeling

Effective Elasticsearch modeling follows three steps: conceptual, logical, and physical design. Key considerations include field types, indexing options, doc values, fielddata, and storage settings.

Mapping Field Settings

enabled

: Disable entire field (true/false). index: Build an inverted index (true/false). index_options: Store docs, freqs, positions, or offsets. norms: Store length normalization (true/false). doc_values: Enable column‑ariented storage for sorting/aggregations. fielddata: Enable in‑memory fielddata for text fields (true/false). store: Store the original field value separately (true/false). coerce: Auto‑convert types (true/false). dynamic: Control automatic mapping ( true, false, strict). date_detection: Auto‑detect date strings (true/false).

Handling Relationships

Elasticsearch does not support joins like relational databases. Two common approaches are:

Nested objects : Store related objects inside a single document; useful when parent and child are updated together.

Parent/Child : Separate documents linked via a join field; allows independent updates.

Example join mapping:

PUT /blog_index_parent_child
{
  "mappings": {
    "doc": {
      "properties": {
        "join": {
          "type": "join",
          "relations": { "blog": "comment" }
        }
      }
    }
  }
}

Parent document:

PUT /blog_index_parent_child/doc/1
{
  "title": "blog",
  "join": "blog"
}

Child document (routing set to parent ID):

PUT /blog_index_parent_child/doc/comment-1?routing=1
{
  "comment": "comment world",
  "join": { "name": "comment", "parent": 1 }
}

Queries: parent_id – find children of a given parent. has_child – find parents that have matching children. has_parent – find children whose parent matches a query.

Reindexing

Reindexing rebuilds data when mappings or settings change. Two APIs: _update_by_query: Updates documents in place (e.g., increment a field). _reindex: Copies data from a source index to a destination index, optionally filtering documents.

Both support asynchronous execution with wait_for_completion=false, returning a task ID that can be monitored via the _tasks API.

Cluster Tuning Recommendations

Keep elasticsearch.yml minimal; use APIs for dynamic settings.

Set cluster.name, node.name, node.master / node.data, and bind network.host to a private IP.

Configure discovery hosts and discovery.zen.minimum_master_nodes (typically 2) to avoid split‑brain.

Allocate JVM heap ≤31 GB and reserve ~50 % for OS file cache.

Size shards based on data volume (e.g., ≤15 GB for search workloads, ≤50 GB for log workloads).

Adjust refresh_interval (or disable with -1) and indices.memory.index_buffer_size to reduce refresh overhead.

Use async translog ( index.translog.durability=async) and increase index.translog.flush_threshold_size to lower disk I/O.

Set replicas to 0 during bulk ingestion, then add them afterward.

Balance shard allocation with index.routing.allocation.total_shards_per_node and monitor shard distribution.

Write Performance Optimization

Client side: use multi‑threaded bulk requests.

Increase refresh_interval or disable refresh during heavy indexing.

Increase indices.memory.index_buffer_size to batch more documents before a refresh.

Set index.translog.durability=async and a larger index.translog.flush_threshold_size to reduce translog fsync frequency.

Temporarily set number_of_replicas=0 while loading data, then restore replicas.

Choose an appropriate number of primary shards; ensure even distribution across nodes.

Read Performance Optimization

Design data models that pre‑compute fields needed for scripts or aggregations.

Use filter context wherever possible; filters are cached and avoid scoring.

Avoid scripts in sorting or aggregations; store computed values as fields.

Profile slow queries with the profile API to identify bottlenecks.

Set an appropriate number of replicas to improve read throughput without over‑replicating.

Keep shard sizes reasonable (15 GB for search, 50 GB for logs) to maintain query speed.

Determining the Right Number of Shards

Measure the throughput of a single‑shard, single‑node index (e.g., 10 k writes/sec). Divide the required production throughput by this baseline, then add replicas and safety margins. Ensure each shard stays within the recommended size limits.

Additional Resources

For further reading, consult the official Elasticsearch documentation, especially the sections on indexing, search, aggregations, and cluster management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch data modeling Aggregation Cluster Tuning

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.