Big Data 8 min read

How 3 Simple Tweaks Doubled Elasticsearch Scan Performance on 40M Docs

The article details a real‑world case of scanning over 40 million Elasticsearch documents, identifies four performance bottlenecks, and presents three concrete optimizations—_source filtering, precise index targeting, and batch‑size tuning—that together cut processing time in half and raise CPU utilization from 25% to 85%.

Mingyi World Elasticsearch
Mingyi World Elasticsearch
Mingyi World Elasticsearch
How 3 Simple Tweaks Doubled Elasticsearch Scan Performance on 40M Docs

Background

A data‑processing project required a full scan of more than 40 million Elasticsearch documents. The initial single‑threaded approach caused the task to run for many hours, and incremental writes plus soft‑deleted fields during the scan added further complexity.

Problem Analysis

Long query response time : Each query returned a large payload, and most latency was spent on data transfer rather than the search itself.

Low resource utilization : Single‑threaded execution left most CPU cores idle, wasting server resources.

Inefficient index traversal : A date alias mapped to multiple underlying indices forced Elasticsearch to merge results across indices, adding unnecessary overhead.

Field redundancy : Business logic needed only a few core fields, yet queries returned all document fields, including large text fields, causing bandwidth and memory waste.

Solution Design

Strategy 1 – Field Reduction via _source Filtering

Only the required fields are returned using _source filtering. The number of fields per document dropped from over 20 to just 4, reducing data transfer by more than 70%.

Strategy 2 – Precise Index Targeting

Instead of querying a date alias that spans many indices, the query now specifies the concrete index name (e.g., data_20240101). This eliminates the merge overhead and noticeably speeds up queries.

Strategy 3 – Batch‑Size Tuning

The size parameter is increased stepwise from the default 100 to values such as 5000, staying within thread‑pool queue limits. Fewer query rounds improve overall throughput.

Implementation Details

Original DSL (reference)

GET /data_alias/_search
{
  "query": {
    "range": {
      "create_time": {
        "gte": "2024-01-01",
        "lte": "2024-01-02"
      }
    }
  },
  "size": 10,
  "from": 0
}

Optimized DSL

GET /data_20240101/_search
{
  "_source": ["id", "status", "create_time", "update_time"],
  "query": {
    "range": {
      "create_time": {
        "gte": "2024-01-01T00:00:00",
        "lte": "2024-01-01T23:59:59"
      }
    }
  },
  "size": 5000,
  "sort": [{"_id": {"order": "asc"}}],
  "search_after": ["last_doc_id"]
}

Incremental Sync Query

GET /data_20240101/_search
{
  "_source": ["id", "status", "create_time", "update_time"],
  "query": {
    "bool": {
      "must": [{
        "range": {"update_time": {"gt": "2024-01-01T10:30:00"}}
      }],
      "must_not": [{
        "term": {"status": "deleted"}
      }]
    }
  },
  "size": 5000,
  "sort": [{"update_time": {"order": "asc"}}]
}

Performance Test Results

Processing Time

Full‑scan duration dropped from 8 hours to 4 hours, a 100% speed‑up.

Resource Utilization

CPU usage rose from 25% to 85%, and memory consumption became more balanced.

Cluster Load

Average query latency fell from 800 ms to 200 ms, significantly easing cluster pressure.

Data Consistency

The incremental sync mechanism ensured that newly added or updated documents were processed correctly during the full scan, preserving data integrity.

Key Takeaways

Configure an appropriate number of threads to fully exploit system resources while avoiding excessive concurrency that could overload the cluster.

_source filtering is a simple yet powerful optimization, especially when documents contain large text fields.

Avoid index aliases when a concrete index name can be used; precise matching eliminates unnecessary overhead.

Plan incremental‑data handling early in large‑scale processing scenarios to maintain consistency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationElasticsearchIncremental SyncBatch Size TuningIndex TargetingLarge Data Scan_source Filtering
Mingyi World Elasticsearch
Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.