Backend Development 14 min read

Mastering Elasticsearch Slow Query Automation: Profiling, DSL Extraction, and Optimization Rules

This article explains how to automate Elasticsearch slow‑query inspection by extracting DSL from slow‑log files, deduplicating queries, using the Profile API for detailed execution analysis, and applying rule‑based optimizations such as avoiding term‑long and range‑keyword queries to improve backend performance.

Weimob Technology Center
Weimob Technology Center
Weimob Technology Center
Mastering Elasticsearch Slow Query Automation: Profiling, DSL Extraction, and Optimization Rules

Elasticsearch, as a document‑oriented search database, offers flexible query statistics but can suffer from resource waste and performance pitfalls. To address this, an automated slow‑query governance workflow is introduced, covering log cleaning, DSL extraction, profiling, and rule‑based optimization.

1. Slow‑Query Inspection Logic

1.1 Log Element Cleaning

Logs are synchronized from Tencent Cloud to a local ES cluster and split by type. An ingest pipeline with a Grok expression extracts key fields:

<code>"\[%{DATA:classname}\] \[%{DATA:node}\] \[%{DATA:indexname}\]\[%{NUMBER:shard_no}\] took\[%{DATA:took}\], took_millis\[%{NUMBER:took_millis}\], total_hits\[%{DATA:hits}\], types\[%{DATA:types}\], stats\[%{DATA:stats}\], search_type\[%{DATA:search_type}\], total_shards\[%{NUMBER:total_shards}\], source\[%{DATA:Message}\], id\[%{DATA:id}\], "</code>

The resulting structured log looks like:

<code>{
    "logType": "SearchSlow",
    "took": "271.7ms",
    "total_shards": "12",
    "types": "_doc",
    "took_millis": "271",
    "Message": "{...}",
    "Ip": "9.20.83.243",
    "shard_no": "4",
    "Cluster": "test-online",
    "Time": "2022-06-27T22:34:59.428+08:00",
    "search_type": "QUERY_THEN_FETCH",
    "hits": "-1",
    "node": "node1",
    "classname": "i.s.s.fetch",
    "indexname": "test_fetch",
    "stats": "",
    "Level": "WARN",
    "id": ""
}</code>

Key fields used later include took/took_millis , Message (DSL), classname , Cluster , and indexname .

1.2 DSL Aggregation

Because each DSL contains varying parameters, simple aggregation cannot deduplicate queries. A script removes parameters via regex and generates an MD5 hash for uniqueness:

<code># 正则匹配去除 dsl 中的传入参数
pattern = r'("max_expansions":)"[0-9]+"'
replacement = r'\1""'
query_dsl = re.sub(pattern, replacement, query_dsl)
# 生成 md5 作为 dsl 的唯一性 id
str_md5 = hashlib.md5(cluster_index_query.encode(encoding='UTF-8')).hexdigest()
</code>

After processing, the aggregated DSL (shown here without surrounding backticks) is stored for further analysis:

<code>{"size":"","query":{"bool":{"must":[{"term":{"id1":{"value":"",}}},{"terms":{"tag_id":[""],}}],"adjust_pure_negative":true,}},"aggregations":{"by_tag_id":{"terms":{"field":"tag_id","size":"",,"show_term_doc_count_error":false,"order":[{"order1":"desc"},{"_key":"asc"}]},"aggregations":{"count_id":{"value_count":{"field":"id"}}}}}}
</code>

The median took value is kept to retain parameterized DSLs, and the ES/Lucene

profile

API is later used for detailed execution analysis.

2. Profile Parsing Points

2.1 Profile Basics

The Profile API reveals how a search request is executed at a low level, helping identify slow stages. It does not measure network latency, queue time, or coordination overhead. A typical response structure is:

<code>{
    "profile": {
        "shards": [
            {
                "id": "[2aE02wS1R8q_QFnYu6vDVQ][my-index-000001][0]",
                "searches": [
                    {
                        "query": [...],
                        "rewrite_time": 51443,
                        "collector": [...]
                    }
                ],
                "aggregations": [...]
            }
        ]
    }
}
</code>

This output includes timing for query rewriting, collector phases, and aggregations.

2.2 Combining Profile with Slow‑Log Data

Slow‑log entries record DSL details and total took time, while the profile provides per‑stage execution costs. Because profile measures only Lucene collector time, it may not fully align with the took metric, especially for fetch phases or network delays.

Special handling includes:

Fetch‑phase slow queries are identified directly from slow logs.

Large profile sub‑stage times are used as optimization hints, not definitive conclusions.

The total took should exceed the sum of profile sub‑stage times; discrepancies guide further investigation.

3. DSL Inspection Rule Results

Daily inspection results are written to an Excel file, with each de‑parameterized DSL occupying a separate sheet. The file is then emailed to cluster owners via LDAP.

3.1 DSL Details

3.2 Rule Outcomes

Inspection rules fall into three categories:

Needs optimization – performance‑wasting query patterns are detected.

Needs analysis – certain stages exceed a time‑percentage threshold and require deeper review.

No optimization needed – query times are within acceptable limits.

3.3 Profile Information Display

The profile view highlights the slowest shard, main phases (rewrite, query, collector, aggs, other), and sub‑phases with color‑coded duration percentages.

4. Optimization Rules and Recommendations

4.1 Rule: Presence of term long Query

Numeric fields stored as

long

use BKD‑tree range lookup, which is fast for range queries but inefficient for exact matches that become

PointRangeQuery

. Large result sets cause heavy CPU usage.

Detection: Sub‑stage type is

PointInSetQuery

or

PointRangeQuery

.

Suggestions:

Change the field type to

keyword

for exact matches.

Use proper

range

queries for numeric ranges.

4.2 Rule: Presence of range keyword / wildcard Query

Keyword fields use inverted indexes; while fast for exact matches, range or wildcard queries on them degrade to full scans.

Detection: Sub‑stage type is

MultiTermQueryConstantScoreWrapper

.

Suggestions:

Convert the field to a numeric type (e.g.,

long

) for range queries.

For wildcard needs, consider an

ngram

analyzer or a dedicated

wildcard

field type.

5. Conclusion

The article presents a comprehensive solution for large‑scale Elasticsearch slow‑query governance. Automated inspection extracts actionable DSLs, combines them with Profile analysis, and applies rule‑based recommendations—such as avoiding

term long

and

range keyword

patterns—to reduce investigation effort and improve overall backend performance.

operationsElasticsearchprofilingbackend performanceslow querydsl optimization
Weimob Technology Center
Written by

Weimob Technology Center

Official platform of the Weimob Technology Center

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.