Big Data 11 min read

Introduction to ElasticSearch: Core Concepts, Architecture, and Common Operations

This article provides a comprehensive overview of ElasticSearch, covering its distributed architecture, fundamental components such as nodes, shards, and indices, as well as practical guidance on index design, mapping, bulk operations, query processing, scroll searches, alias management, and performance tuning tips.

360 Tech Engineering

Dec 27, 2019

Introduction to ElasticSearch: Core Concepts, Architecture, and Common Operations

ElasticSearch is an open‑source search engine built on Apache Lucene that offers distributed, near‑real‑time capabilities and a standard RESTful API. It can be deployed as a single node or as a cluster, allowing the system to handle data volumes beyond a single machine’s capacity while providing uninterrupted service.

Overall Architecture

Basic Concepts

Node : a single ElasticSearch service instance.

Master : supervises and controls other nodes.

Data : stores data and provides indexing capabilities.

Coordinating node : any node can act as a coordinator, gathering results from shards and returning them to the client; it requires sufficient CPU and memory.

Index : analogous to a database; stores documents and uses Lucene internally.

Shard : a physical Lucene index; the process of distributing data into shards is called sharding.

Primary shard : number of primary shards is fixed at index creation.

Replica shard : provides load‑balancing, fault tolerance, and increased write latency.

Type (type)

In version 5.x, index and type have a one‑to‑many relationship; in 6.x they become one‑to‑one; in 7.x the type concept is removed, leaving a single data type per index, and it will be fully removed in 8.x.

Document (doc)

A document is the primary entity in ElasticSearch, composed of fields (name + value) and represented as a JSON object from the client’s perspective.

Cluster Health Status

Green : all primary and replica shards are active.

Yellow : all primary shards are active, but some replicas are not.

Red : some primary shards are unavailable.

ES Operations

Index Design

Index design consists of mapping and settings . Settings define the number of shards and replicas.

Mapping

Mappings define field types and properties. Dynamic mapping generates mappings automatically from indexed data, but it is discouraged because it can degrade performance, increase disk usage, and produce unexpected query results.

Typical mapping syntax is shown in the accompanying image.

Templates

Logstash can generate index names based on the @timestamp field (e.g., logstash-2019.10.01) and apply predefined settings and mappings to those indices.

The template API can create a template named my_logs, apply it to all indices starting with logstash-, set the order, limit primary shards to 10, and disable the _all field.

Write Operations

While a single HTTP POST request can index one document, this is inefficient. ElasticSearch provides a bulk API that batches multiple operations into a single request, dramatically improving write throughput.

Recommended bulk size: keep the request in memory, so overly large batches can hurt performance. A good batch size is typically 5‑15 MB, corresponding to roughly 1,000‑5,000 documents depending on document size.

Data Retrieval Process

Search consists of two phases: query and fetch .

Query Phase

The client sends a search request to a coordinating node (e.g., Node 3), which creates a priority queue of size from + size.

The coordinating node forwards the request to each primary or replica shard; each shard performs the local query and adds results to its own priority queue.

Shards return document IDs and sort values to the coordinating node, which merges them into a global sorted list.

Fetch Phase

The coordinating node identifies the documents to retrieve and sends a multi‑get request to the relevant shards.

Each shard loads the requested documents and returns them.

Once all documents are gathered, the coordinating node returns the final result set to the client.

Scroll (Cursor) Search

Scroll enables efficient retrieval of large result sets without the cost of deep pagination, similar to a database cursor. Each scroll request returns a new _scroll_id, which must be supplied in the subsequent request.

Many language clients (e.g., Python, Perl) provide convenient wrappers for scroll operations.

Index Aliases

Aliases act like symbolic links to one or more indices, allowing seamless index re‑creation and zero‑downtime index switching. An alias cannot share the same name as an index.

Renaming an alias follows a similar pattern (see image).

DSL Query Optimization Tips

Use the appropriate query type (e.g., match, match_phrase, term) and combine clauses correctly with must, must_not, should.

Prefer filter clauses whenever scoring is not required.

Avoid relevance scoring operations if they are unnecessary.

Choose suitable field types (e.g., use keyword for exact matches in mappings).

Conclusion

This article introduced ElasticSearch’s basic concepts and common practical methods. It does not cover deep internal mechanisms or advanced optimizations; further study is possible in areas such as DSL tuning, read/write performance, and new features in version 7.x and beyond.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Indexing Mapping cluster Search bulk

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.