Master Elasticsearch Data Modeling: From Business Needs to Advanced Index Strategies
This guide walks you through the full Elasticsearch data‑modeling workflow, covering business‑driven design, handling large data volumes, optimal settings, mapping choices, and complex parent‑child relationships, while providing practical code examples and visual diagrams for immediate application.
Why Data Modeling Matters
Elasticsearch’s flexible schema can lead to inefficient storage and slow queries if the data model is not designed deliberately. Proper modeling reduces storage waste, improves query performance, and avoids costly re‑indexing.
Business‑Driven Modeling
Group related business entities and decide whether each group needs a separate index or can share one. Use consistent field names across similar sources (e.g., social‑media platforms) to simplify DSL queries. Apply index templates and aliases to manage groups of indices with a common prefix.
Modeling for Data Volume
For time‑series or rapidly growing data, split indices by time (daily, monthly) to enable efficient deletion, archiving, and hot‑cold node architectures.
Index‑Level Settings
number_of_shards : choose based on expected data size and cluster scale; immutable after creation.
number_of_replicas : at least one for high availability.
refresh_interval : default 1s; increase (e.g., 30s) when near‑real‑time visibility is not required.
max_result_window : default 10,000; keep unless deep pagination is needed—use search_after or scroll instead.
Ingest pipelines can add timestamps or perform other preprocessing before indexing:
PUT _ingest/pipeline/indexed_at
{
"description": "Adds indexed_at timestamp to documents",
"processors": [
{ "set": { "field": "_source.indexed_at", "value": "{{_ingest.timestamp}}" } }
]
}Creating an index that uses the pipeline:
PUT my_index_0001
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index": { "default_pipeline": "indexed_at" }
},
"mappings": {
"properties": {
"cont": {
"type": "text",
"analyzer": "ik_max_word",
"fields": { "keyword": { "type": "keyword" } }
}
}
}
}Mapping‑Level Modeling
Use keyword for exact matches, sorting, and aggregations; use text with appropriate analyzers for full‑text search.
Select the smallest numeric type that fits the data (e.g., integer instead of long).
For Chinese text, choose ik_max_word (fine‑grained) or ik_smart (coarse‑grained).
Define multi_fields to store both text and keyword versions of a field.
Example of a mixed‑field mapping:
PUT mix_index
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"standard": { "type": "text", "analyzer": "standard" },
"keyword": { "type": "keyword", "ignore_above": 256 }
}
}
}
}
}Complex Index Relationships
Elasticsearch does not support relational joins; instead use one of the following patterns:
Wide‑table (denormalization) : duplicate parent data in child documents for single‑index queries.
Nested type : store arrays of objects that need independent querying; suitable for low‑update, high‑read scenarios.
Join type (parent‑child) : use has_child / has_parent queries for 1‑to‑N relationships where child documents are updated frequently.
Application‑level joins : perform multiple queries and combine results in code when index‑level solutions are impractical.
Wide‑table example:
PUT user/_doc/1
{ "name": "John Smith", "email": "[email protected]", "dob": "1970/10/24" }
PUT blogpost/_doc/2
{ "title": "Relationships", "body": "It's complicated...", "user": { "id": 1, "name": "John Smith" } }Parent‑child query example:
GET /blogpost/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "relationships" } },
{ "match": { "user.name": "John" } }
]
}
}
}Key Recommendations
Prefer denormalization (space‑for‑time) over runtime scripts.
Use ingest pipelines for preprocessing rather than post‑ingest scripts.
Leverage routing, index sorting, and appropriate field types to keep queries fast and storage efficient.
Disable dynamic mapping in production; enforce a strict schema to prevent field explosion.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
