Elasticsearch Optimization Practices for Large‑Scale Data Platforms
This article presents a comprehensive guide to optimizing Elasticsearch for massive data volumes, covering Lucene fundamentals, index and shard design, practical performance‑tuning techniques, and real‑world testing results that enable cross‑month queries and sub‑second response times.
1. Introduction
The data platform has evolved through three versions, encountering many common challenges; this article shares refined documentation focusing on Elasticsearch (ES) optimization, while referring readers to other sources for HBase and Hadoop design improvements.
2. Requirements
Project background:
In a business system, some tables generate over a hundred million rows per day. Data is partitioned by day, but queries are limited to daily ranges and the database can retain only three months of data due to hardware constraints, making sharding costly.
Improvement goals:
Enable cross‑month queries and support export of more than one year of historical data.
Return condition‑based query results within seconds.
3. Elasticsearch Retrieval Principles
3.1 Basic structure of ES and Lucene
Understanding component fundamentals is essential for locating bottlenecks. An ES index consists of one or more Lucene shards, each shard being a Lucene instance. Key concepts include:
Cluster: a group of nodes.
Node: a single service unit in the cluster.
Index: logical namespace for one or more physical shards.
Type: classification within an index (only one type allowed after ES 6.x).
Document: the smallest indexable JSON unit.
Shard: the basic work unit storing a subset of data; each shard is a Lucene instance.
Replica: a copy of a shard for safety and load balancing.
Lucene separates indexing (analyzers, tokenizers, filters) from searching (query parsers). A Lucene index contains multiple segments, each holding many documents and fields.
3.2 Lucene index implementation
Lucene index files are organized into dictionaries, inverted lists, forward files, and DocValues. The following diagram (originally from the Lucene documentation) illustrates the structure.
Note: Information compiled from the official Lucene documentation: http://lucene.apache.org/core/7_2_1/core/org/apache/lucene/codecs/lucene70/package-summary.html#package.description
Random‑access reads on .fdt files are costly; .tim and .doc benefit from SSD storage. Scoring is also expensive and can be disabled when not needed.
About DocValues
Inverted indexes quickly locate document IDs for a term, but sorting, grouping, or aggregating requires retrieving field values for those IDs. DocValues provide a column‑store structure that enables fast lookup by doc ID. ES enables DocValues by default for all fields except analyzed strings; they can be turned off to save resources.
3.3 ES index and shard routing
An ES index is composed of one or more Lucene indices, each consisting of multiple segments. Document routing determines the target primary shard using the formula:
shard = hash(routing) % number_of_primary_shards
By default, routing uses the document ID (MurmurHash3). Supplying a custom _routing parameter groups related documents on the same shard, reducing search work and improving performance.
4. Optimization Cases
In our case, queries are limited to fixed fields (no full‑text search), which is a prerequisite for achieving sub‑second responses on billions of rows.
ES stores only the HBase rowkey, not the actual payload.
Actual data resides in HBase and is retrieved via the rowkey.
Performance‑tuning recommendations follow the official ES guide (e.g., https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).
4.1 Index performance tuning
Bulk writes (hundreds to thousands of records per request).
Multi‑threaded ingestion, with thread count roughly matching the number of nodes.
Increase the segment refresh interval (e.g., set "refresh_interval": "-1" and manually refresh after bulk load).
Allocate ~50% of system memory to Lucene file cache; each node should have ample RAM (64 GB+ recommended).
Use SSDs instead of HDD RAID for lower random I/O latency.
Prefer auto‑generated IDs; in our case we use a custom key aligned with HBase rowkey to enable delete/update without significant performance loss.
Control segment merging: limit merge throughput (e.g., indices.store.throttle.max_bytes_per_sec: "200mb") and adjust merge thread count based on storage type.
4.2 Search performance tuning
Disable DocValues for fields that do not require sorting or aggregation.
Prefer keyword type over numeric types for exact matches; term queries are faster than range queries.
Turn off _source for fields that are never needed in the response to save disk space.
Disable scoring when not required; use filter or constant_score queries.
Pagination strategies: from + size incurs high cost for deep pages; default index.max_result_window is 10 000. search_after uses the last hit of the previous page for efficient deep pagination. scroll is suitable for large result sets but requires maintaining a scroll_id.
Introduce a long field that encodes timestamp + ID for efficient sorting; forward and inverted index performance is comparable.
Allocate CPUs with ≥16 cores for sorting‑heavy workloads.
Set "merge.policy.expunge_deletes_allowed": "0" to force deletion of marked records during merges.
{
"mappings": {
"data": {
"dynamic": "false",
"_source": {
"includes": ["XXX"] // store only required fields in _source
},
"properties": {
"state": {
"type": "keyword", // use keyword instead of int for exact match
"doc_values": false // disable unnecessary doc values
},
"b": {
"type": "long" // use long/int for range queries
}
}
}
},
"settings": {......}
}5. Performance Testing
Single‑node test with 50 M–100 M records to assess point‑node capacity.
Cluster test with 100 M–3 B records to evaluate disk I/O, memory, CPU, and network consumption.
Random query combinations across data volumes to measure latency.
Compare SSD versus HDD performance.
Benchmarking is essential to verify that the optimizations deliver measurable gains; without it, performance regressions are hard to detect.
6. Production Results
The platform now handles billions of rows, returning 100 records within three seconds, with fast pagination. Future bottlenecks can be addressed by scaling out additional nodes.
Author: mikevictor Source: https://www.cnblogs.com/mikevictor07/p/10006553.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
