Elasticsearch Optimization Practices for Large-Scale Data Platforms
This article explains the architecture of Elasticsearch and Lucene, outlines common performance bottlenecks, and provides concrete indexing and query optimization techniques—including shard routing, refresh intervals, doc values, and hardware considerations—to achieve sub‑second query responses on billions of records.
1. Introduction
The data platform has evolved through three versions, encountering many typical challenges; this article shares refined documentation focusing on Elasticsearch (ES) optimization.
2. Requirements
Project Background
In a business system, some tables generate over a hundred million rows per day; data is partitioned by day, retained for three months, and cross‑month queries are needed.
Improvement Goals
1) Enable cross‑month queries and support more than one year of historical data export. 2) Achieve second‑level response times for conditional queries.
3. ES Retrieval Principles
3.1 ES and Lucene Basics
Understanding component fundamentals is essential for pinpointing bottlenecks. ES relies on Lucene for storage and search.
Basic Concepts
Cluster : a group of Nodes. Node : a service unit within a cluster. Index : logical namespace for one or more physical shards. Type : classification within an index (single type after ES 6.x). Document : the smallest indexable JSON unit. Shards : Lucene instances storing subsets of data. Replicas : shard copies for safety and load distribution.
3.2 Lucene Index Implementation
Lucene index files consist of dictionaries, inverted lists, forward files, and DocValues.
DocValues provide column‑store access for sorting, faceting, and aggregation, but they consume resources; disabling unused DocValues can improve performance.
3.3 ES Index and Search Sharding
Data is stored in shards based on hash(routing) % number_of_primary_shards. By default, routing uses the document ID (MurmurHash3), but the _routing parameter can force related documents onto the same shard, reducing search work.
4. Optimization Cases
In our case, queries are field‑specific (no full‑text search), allowing billion‑row queries to return within seconds.
4.1 Indexing Performance
1) Batch writes (hundreds to thousands of records per request). 2) Multi‑threaded writes matching the number of machines. 3) Increase refresh_interval (e.g., set to "-1") during bulk ingestion and manually refresh afterward. 4) Allocate ~50% of system memory to Lucene file cache; nodes often need >64 GB RAM. 5) Use SSDs for random I/O. 6) Use custom keys aligned with HBase rowkeys. 7) Tune segment merging: adjust indices.store.throttle.max_bytes_per_sec and index.merge.scheduler.max_thread_count based on disk type.
4.2 Search Performance
1) Disable DocValues for unused fields. 2) Prefer keyword fields over numeric types for term queries. 3) Disable _source storage for fields not needed in results. 4) Use filters or constant_score queries to avoid scoring overhead.
4.3 Pagination
Discusses the cost of from+size pagination, the search_after technique, and scroll API limitations.
{
"mappings": {
"data": {
"dynamic": "false",
"_source": {"includes": ["XXX"]},
"properties": {
"state": {"type": "keyword"},
"doc_values": false,
"b": {"type": "long"}
}
},
"settings": {......}
}
}5. Performance Testing
Benchmarks include single‑node tests with 50 M–100 M records, cluster tests up to 3 B records, and comparisons of SSD vs. HDD I/O.
6. Production Impact
The platform now handles billions of records with 100‑row queries returning in under 3 seconds, and scaling can be achieved by adding nodes.
Author: mikevictor Source: http://suo.im/5Aytfg
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Tech Stack
Java backend, microservices, distributed systems, containerized programming, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
