Databases 23 min read

Mastering Elasticsearch Index Design: From Basics to Shard Planning

This article provides a comprehensive guide to Elasticsearch index concepts, covering index definitions, alias usage, mapping and field types, shard architecture, and practical recommendations for planning shards and optimizing resource consumption to ensure stable and efficient ES clusters.

Architect

Apr 23, 2025

Mastering Elasticsearch Index Design: From Basics to Shard Planning

Background

As Elasticsearch (ES) usage grows in business scenarios, the platform faces increasing pressure on cluster stability, management, and operations. Users often create indexes with inconsistent structures or copy scripts without understanding index fundamentals, leading to stability issues. To address this, the platform introduced template approval flows and dynamic shard expansion without downtime.

What Is an Index?

An index in ES is a collection of documents with similar characteristics, similar to a table in relational databases. Each index has a unique name and an _id. Documents are JSON objects that may be structured, semi‑structured, or unstructured. Indexes store, retrieve, and analyze data, and support search and aggregation operations.

Official description: "The index is the fundamental unit of storage in Elasticsearch, a logical namespace for storing data that share similar characteristics."

Index Structure Details

The index structure consists of three main components:

Alias : A logical name that can point to one or more indices or data streams, allowing queries, real‑time index switching, and zero‑downtime reindexing.

Mapping : Defines the data schema for documents. Field types are set at creation and cannot be changed later.

Settings : Includes shard count, replica count, refresh interval, etc.

Alias

Aliases are managed in the cluster state by the master node and add negligible overhead. They enable:

Querying multiple indices with a single name.

Changing the target index in real time.

Reindexing without downtime.

The platform recommends adding an alias to every index to support dynamic shard expansion.

Ways to Add an Alias

PUT /test_index
{
  "settings": {"number_of_shards": 1, "number_of_replicas": 1},
  "aliases": {"test_alias": {}},
  "mappings": {"properties": {"field1": {"type": "text"}, "createdAt": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss"}}}
}

POST /_aliases
{
  "actions": [{"add": {"index": "test_index", "alias": "test_alias"}}]
}

POST /_aliases
{
  "actions": [
    {"add": {"index": "existing_index", "alias": "test_alias"}},
    {"remove": {"index": "old_index", "alias": "old_test_alias"}}
  ]
}

Mapping

Mapping defines the data structure of documents. Once a field type is set, it cannot be changed because ES builds a specific index structure for that type. ES also supports automatic mapping, which infers field types from incoming data.

Field Types

The most common field types are text , keyword , and numeric (integer, long, double, etc.).

Text

Used for full‑text search; analyzed into tokens.

Not suitable for sorting or aggregations.

Can be combined with a keyword sub‑field for exact matching.

Keyword

Stored as a whole without analysis; ideal for exact matches, sorting, and aggregations.

Supports case‑insensitive term queries via the case_insensitive parameter.

Numeric

Includes long, integer, float, double, etc.

Best for range queries, sorting, and aggregations.

Recommendations for Field Types

Prefer keyword for fields that do not require full‑text search.

Use multi‑fields to store both text and keyword versions when needed.

For numeric data, choose the smallest suitable type to save space.

Avoid using text for aggregations; use keyword instead.

Shard Structure (分片与副本)

ES splits an index into primary shards and replica shards . Primary shards hold the actual data; replicas provide redundancy and increase read throughput. The number of primary shards is set at index creation via number_of_shards and cannot be changed later.

Primary vs. Replica

Primary shard : Stores a portion of the index data and its segment files. Can be moved across nodes for load balancing.

Replica shard : A full copy of a primary shard; promotes to primary if the original fails.

Important note: a single shard should not exceed 2,147,483,519 documents.

Shard Planning

Choosing the right shard count is critical. Too many shards increase memory usage, file handles, and CPU overhead; too few can cause hotspots.

Estimate primary shards as total_data_size / ideal_shard_size (10‑50 GB per shard is recommended).

Set replicas based on read‑heavy scenarios, but remember they add write overhead.

Avoid frequent refreshes; they cause high I/O and CPU load.

For time‑series data, consider index lifecycle management (ILM) and rolling indexes (daily/weekly/monthly).

Resource Impact

Each shard consumes ~10‑30 MB of heap for metadata.

Excessive shards can exhaust file descriptors.

Segment fragmentation increases I/O and memory pressure.

Relationship Between Index and Resource Consumption

More shards mean more independent Lucene indexes, each with its own caches and segments. This leads to higher memory consumption, increased garbage‑collection pressure, and potential “too many open files” errors.

Summary

Creating an Elasticsearch index requires careful consideration of field types, mapping, alias usage, and shard planning. Properly designed indexes improve cluster stability, query performance, and resource efficiency. Continual monitoring and adjustment of shard counts are essential as data volume and access patterns evolve.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization Elasticsearch index design Mapping sharding

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.