Databases 18 min read

Master Elasticsearch Data Modeling: From Business Needs to Advanced Index Strategies

This guide walks you through the full Elasticsearch data‑modeling workflow, covering business‑driven design, handling large data volumes, optimal settings, mapping choices, and complex parent‑child relationships, while providing practical code examples and visual diagrams for immediate application.

dbaplus Community

Nov 15, 2022

Master Elasticsearch Data Modeling: From Business Needs to Advanced Index Strategies

Why Data Modeling Matters

Elasticsearch’s flexible schema can lead to inefficient storage and slow queries if the data model is not designed deliberately. Proper modeling reduces storage waste, improves query performance, and avoids costly re‑indexing.

Business‑Driven Modeling

Group related business entities and decide whether each group needs a separate index or can share one. Use consistent field names across similar sources (e.g., social‑media platforms) to simplify DSL queries. Apply index templates and aliases to manage groups of indices with a common prefix.

Modeling for Data Volume

For time‑series or rapidly growing data, split indices by time (daily, monthly) to enable efficient deletion, archiving, and hot‑cold node architectures.

Index‑Level Settings

number_of_shards : choose based on expected data size and cluster scale; immutable after creation.

number_of_replicas : at least one for high availability.

refresh_interval : default 1s; increase (e.g., 30s) when near‑real‑time visibility is not required.

max_result_window : default 10,000; keep unless deep pagination is needed—use search_after or scroll instead.

Ingest pipelines can add timestamps or perform other preprocessing before indexing:

PUT _ingest/pipeline/indexed_at
{
  "description": "Adds indexed_at timestamp to documents",
  "processors": [
    { "set": { "field": "_source.indexed_at", "value": "{{_ingest.timestamp}}" } }
  ]
}

Creating an index that uses the pipeline:

PUT my_index_0001
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "30s",
    "index": { "default_pipeline": "indexed_at" }
  },
  "mappings": {
    "properties": {
      "cont": {
        "type": "text",
        "analyzer": "ik_max_word",
        "fields": { "keyword": { "type": "keyword" } }
      }
    }
  }
}

Mapping‑Level Modeling

Use keyword for exact matches, sorting, and aggregations; use text with appropriate analyzers for full‑text search.

Select the smallest numeric type that fits the data (e.g., integer instead of long).

For Chinese text, choose ik_max_word (fine‑grained) or ik_smart (coarse‑grained).

Define multi_fields to store both text and keyword versions of a field.

Example of a mixed‑field mapping:

PUT mix_index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "fields": {
          "standard": { "type": "text", "analyzer": "standard" },
          "keyword": { "type": "keyword", "ignore_above": 256 }
        }
      }
    }
  }
}

Complex Index Relationships

Elasticsearch does not support relational joins; instead use one of the following patterns:

Wide‑table (denormalization) : duplicate parent data in child documents for single‑index queries.

Nested type : store arrays of objects that need independent querying; suitable for low‑update, high‑read scenarios.

Join type (parent‑child) : use has_child / has_parent queries for 1‑to‑N relationships where child documents are updated frequently.

Application‑level joins : perform multiple queries and combine results in code when index‑level solutions are impractical.

Wide‑table example:

PUT user/_doc/1
{ "name": "John Smith", "email": "[email protected]", "dob": "1970/10/24" }

PUT blogpost/_doc/2
{ "title": "Relationships", "body": "It's complicated...", "user": { "id": 1, "name": "John Smith" } }

Parent‑child query example:

GET /blogpost/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "relationships" } },
        { "match": { "user.name": "John" } }
      ]
    }
  }
}

Key Recommendations

Prefer denormalization (space‑for‑time) over runtime scripts.

Use ingest pipelines for preprocessing rather than post‑ingest scripts.

Leverage routing, index sorting, and appropriate field types to keep queries fast and storage efficient.

Disable dynamic mapping in production; enforce a strict schema to prevent field explosion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch index design Mapping data modeling Parent-Child Ingest Pipeline Nested

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.