How 3 Simple Tweaks Doubled Elasticsearch Scan Performance on 40M Docs
The article details a real‑world case of scanning over 40 million Elasticsearch documents, identifies four performance bottlenecks, and presents three concrete optimizations—_source filtering, precise index targeting, and batch‑size tuning—that together cut processing time in half and raise CPU utilization from 25% to 85%.
Background
A data‑processing project required a full scan of more than 40 million Elasticsearch documents. The initial single‑threaded approach caused the task to run for many hours, and incremental writes plus soft‑deleted fields during the scan added further complexity.
Problem Analysis
Long query response time : Each query returned a large payload, and most latency was spent on data transfer rather than the search itself.
Low resource utilization : Single‑threaded execution left most CPU cores idle, wasting server resources.
Inefficient index traversal : A date alias mapped to multiple underlying indices forced Elasticsearch to merge results across indices, adding unnecessary overhead.
Field redundancy : Business logic needed only a few core fields, yet queries returned all document fields, including large text fields, causing bandwidth and memory waste.
Solution Design
Strategy 1 – Field Reduction via _source Filtering
Only the required fields are returned using _source filtering. The number of fields per document dropped from over 20 to just 4, reducing data transfer by more than 70%.
Strategy 2 – Precise Index Targeting
Instead of querying a date alias that spans many indices, the query now specifies the concrete index name (e.g., data_20240101). This eliminates the merge overhead and noticeably speeds up queries.
Strategy 3 – Batch‑Size Tuning
The size parameter is increased stepwise from the default 100 to values such as 5000, staying within thread‑pool queue limits. Fewer query rounds improve overall throughput.
Implementation Details
Original DSL (reference)
GET /data_alias/_search
{
"query": {
"range": {
"create_time": {
"gte": "2024-01-01",
"lte": "2024-01-02"
}
}
},
"size": 10,
"from": 0
}Optimized DSL
GET /data_20240101/_search
{
"_source": ["id", "status", "create_time", "update_time"],
"query": {
"range": {
"create_time": {
"gte": "2024-01-01T00:00:00",
"lte": "2024-01-01T23:59:59"
}
}
},
"size": 5000,
"sort": [{"_id": {"order": "asc"}}],
"search_after": ["last_doc_id"]
}Incremental Sync Query
GET /data_20240101/_search
{
"_source": ["id", "status", "create_time", "update_time"],
"query": {
"bool": {
"must": [{
"range": {"update_time": {"gt": "2024-01-01T10:30:00"}}
}],
"must_not": [{
"term": {"status": "deleted"}
}]
}
},
"size": 5000,
"sort": [{"update_time": {"order": "asc"}}]
}Performance Test Results
Processing Time
Full‑scan duration dropped from 8 hours to 4 hours, a 100% speed‑up.
Resource Utilization
CPU usage rose from 25% to 85%, and memory consumption became more balanced.
Cluster Load
Average query latency fell from 800 ms to 200 ms, significantly easing cluster pressure.
Data Consistency
The incremental sync mechanism ensured that newly added or updated documents were processed correctly during the full scan, preserving data integrity.
Key Takeaways
Configure an appropriate number of threads to fully exploit system resources while avoiding excessive concurrency that could overload the cluster.
_source filtering is a simple yet powerful optimization, especially when documents contain large text fields.
Avoid index aliases when a concrete index name can be used; precise matching eliminates unnecessary overhead.
Plan incremental‑data handling early in large‑scale processing scenarios to maintain consistency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
