Master Elasticsearch: Index Design, Field Types, and Cluster Management Tips
An experienced engineer shares practical Elasticsearch insights covering index design with aliases and routing, field type choices, query optimization techniques, pagination strategies, real‑time refresh settings, memory limits, and cluster management, offering concrete examples and actionable recommendations for robust search implementations.
Introduction
“When I first touched Elasticsearch, it felt like a black box—just dump data, write a query, and get results. Working on the core search module of my company revealed many hidden details.”
Below is a compilation of practical Elasticsearch experience, focusing on index design , field types , query optimization , cluster management , and architecture design .
Index Design: From Basics to Advanced
1. Index Alias – A Safety Net for Changes
Directly using an index name makes schema changes painful because Elasticsearch does not allow modifying mappings or the number of primary shards. The solution is to always reference an alias in application code; when rebuilding an index, simply switch the alias to the new index, keeping users unaware of the change.
2. Routing – Precise Queries for SaaS E‑commerce
In a SaaS e‑commerce system, querying a single merchant’s orders was slow because the default hash‑based routing scattered a merchant’s data across multiple shards. By using the merchant ID as the routing key during indexing and searching, all documents for that merchant reside in the same shard.
Before: query scans all shards (e.g., 3 shards).
After: query scans only 1 shard.
Result: query speed roughly doubles with lower resource consumption.
3. Shard Splitting – Handling Data Growth
When an index grows, simply adding shards is not optimal. Recommended shard sizes:
Business index : 10–30 GB per shard.
Search index : ≤10 GB per shard.
Log index : 20–50 GB per shard.
For SaaS systems with “super‑large merchants”, split indices by merchant_id % 64, creating indices like orders_001 … orders_064, each holding a subset of merchants’ data, and continue to use the routing key for queries.
“Choose shard‑splitting rules and routing algorithms based on business data volume and requirements, and avoid creating excessive shards that burden the cluster.”
Field Types: Choose What Matters
4. Text vs. Keyword – Core Differences
A pitfall: storing phone numbers as text caused them to be tokenized (e.g., 13800138000 → 138, 0013, 8000), making exact searches impossible. Use keyword for exact matches (order numbers, phone numbers) and text for full‑text search.
Use text when you need analysis (e.g., product descriptions).
Use keyword for exact matching; queries like term or terms are faster and use less storage.
5. Multi‑Fields – Use When Needed
Elasticsearch automatically creates a keyword sub‑field for text, but you can disable it if you only need full‑text search. Enable multi‑fields when you require both exact matching/aggregation and analyzed search.
Enable multi‑fields for precise matching and aggregations.
Disable for pure full‑text search to save storage and improve write speed.
6. Sorting Fields – Pick the Right Type
Sorting numeric values with a keyword field leads to lexical ordering (e.g., 100 before 99). Use numeric types ( long, integer) for numbers and date for timestamps.
Numeric sorting: use long or integer.
Time sorting: use date.
Result: faster sorting and lower memory usage.
Query Optimization: Balancing Speed and Accuracy
7. Fuzzy Queries – Proper Usage
Before Elasticsearch 7.9, wildcard queries were a performance trap because they relied on regex and scanned all terms when a leading wildcard was used. From 7.9 onward, use the dedicated wildcard field type, which leverages optimized n‑gram and binary doc‑value mechanisms for much better performance.
“For a detailed comparison of wildcard before and after ES 7.9, see my previous article.”
8. Pagination – Avoid Deep Pagination Pitfalls
Deep pagination hurts performance. Preferred approaches:
Shallow pagination : use from/size for the first few pages.
Scroll : suitable for large data exports, but requires managing scroll_id and consumes more resources.
search_after : paginate based on the last hit of the previous page; cannot jump to arbitrary pages and adds server load if used frequently.
Business‑level design to avoid deep pagination is usually the best solution.
Cluster Management: Ensuring Stable Operation
9. Index Lifecycle – Automated Maintenance
Log data grows continuously. Recommended practice:
Create daily indices (e.g., log_20231201).
Set retention policies (e.g., keep 7 days or 30 days).
Combine with index templates for automated management.
10. Near‑Real‑Time Refresh – Understanding the Mechanism
Elasticsearch refreshes an index every second by default, balancing real‑time searchability and write performance. Adjust refresh_interval based on workload: keep 1 s for high‑freshness needs, increase the interval for heavy write loads.
If immediate visibility after a write is required, either let the front‑end display the newly submitted data and query later, or delay the query by about 1.5 seconds.
11. Memory Configuration – The 32 GB Truth
Java’s compressed ordinary object pointers (Compressed OOPs) are effective up to 32 GB; beyond that, memory waste increases. Recommended node memory allocation: use roughly 50 % of the physical RAM for the JVM heap, leaving the rest for the OS.
12. Elasticsearch vs. Database – Clear Separation of Duties
Store searchable fields and document IDs in Elasticsearch, while keeping the full business data in a relational database. Query flow: ES returns IDs, the application fetches full details from the database, achieving both fast search and strong consistency.
13. Nested Objects – Preserve Data Relationships
When indexing array‑like data (e.g., product specifications), using a plain object type flattens the structure and breaks relationships. Use the nested type to keep each array element independent, ensuring accurate queries.
14. Replica Configuration – Balancing Read/Write
Replicas improve search capacity but increase write load. Typical recommendation: one replica for most scenarios; increase only under heavy query pressure, being aware that more replicas raise write overhead.
Conclusion
The key takeaway is that understanding the underlying principles of Elasticsearch—its indexing model, routing, shard management, and refresh mechanics—is far more valuable than memorizing commands. With solid fundamentals, you can adapt the platform to any new business challenge.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
