Mastering Elasticsearch Slow Query Automation: Profiling, DSL Extraction, and Optimization Rules
This article explains how to automate Elasticsearch slow‑query inspection by extracting DSL from slow‑log files, deduplicating queries, using the Profile API for detailed execution analysis, and applying rule‑based optimizations such as avoiding term‑long and range‑keyword queries to improve backend performance.
Elasticsearch, as a document‑oriented search database, offers flexible query statistics but can suffer from resource waste and performance pitfalls. To address this, an automated slow‑query governance workflow is introduced, covering log cleaning, DSL extraction, profiling, and rule‑based optimization.
1. Slow‑Query Inspection Logic
1.1 Log Element Cleaning
Logs are synchronized from Tencent Cloud to a local ES cluster and split by type. An ingest pipeline with a Grok expression extracts key fields:
<code>"\[%{DATA:classname}\] \[%{DATA:node}\] \[%{DATA:indexname}\]\[%{NUMBER:shard_no}\] took\[%{DATA:took}\], took_millis\[%{NUMBER:took_millis}\], total_hits\[%{DATA:hits}\], types\[%{DATA:types}\], stats\[%{DATA:stats}\], search_type\[%{DATA:search_type}\], total_shards\[%{NUMBER:total_shards}\], source\[%{DATA:Message}\], id\[%{DATA:id}\], "</code>The resulting structured log looks like:
<code>{
"logType": "SearchSlow",
"took": "271.7ms",
"total_shards": "12",
"types": "_doc",
"took_millis": "271",
"Message": "{...}",
"Ip": "9.20.83.243",
"shard_no": "4",
"Cluster": "test-online",
"Time": "2022-06-27T22:34:59.428+08:00",
"search_type": "QUERY_THEN_FETCH",
"hits": "-1",
"node": "node1",
"classname": "i.s.s.fetch",
"indexname": "test_fetch",
"stats": "",
"Level": "WARN",
"id": ""
}</code>Key fields used later include took/took_millis , Message (DSL), classname , Cluster , and indexname .
1.2 DSL Aggregation
Because each DSL contains varying parameters, simple aggregation cannot deduplicate queries. A script removes parameters via regex and generates an MD5 hash for uniqueness:
<code># 正则匹配去除 dsl 中的传入参数
pattern = r'("max_expansions":)"[0-9]+"'
replacement = r'\1""'
query_dsl = re.sub(pattern, replacement, query_dsl)
# 生成 md5 作为 dsl 的唯一性 id
str_md5 = hashlib.md5(cluster_index_query.encode(encoding='UTF-8')).hexdigest()
</code>After processing, the aggregated DSL (shown here without surrounding backticks) is stored for further analysis:
<code>{"size":"","query":{"bool":{"must":[{"term":{"id1":{"value":"",}}},{"terms":{"tag_id":[""],}}],"adjust_pure_negative":true,}},"aggregations":{"by_tag_id":{"terms":{"field":"tag_id","size":"",,"show_term_doc_count_error":false,"order":[{"order1":"desc"},{"_key":"asc"}]},"aggregations":{"count_id":{"value_count":{"field":"id"}}}}}}
</code>The median took value is kept to retain parameterized DSLs, and the ES/Lucene
profileAPI is later used for detailed execution analysis.
2. Profile Parsing Points
2.1 Profile Basics
The Profile API reveals how a search request is executed at a low level, helping identify slow stages. It does not measure network latency, queue time, or coordination overhead. A typical response structure is:
<code>{
"profile": {
"shards": [
{
"id": "[2aE02wS1R8q_QFnYu6vDVQ][my-index-000001][0]",
"searches": [
{
"query": [...],
"rewrite_time": 51443,
"collector": [...]
}
],
"aggregations": [...]
}
]
}
}
</code>This output includes timing for query rewriting, collector phases, and aggregations.
2.2 Combining Profile with Slow‑Log Data
Slow‑log entries record DSL details and total took time, while the profile provides per‑stage execution costs. Because profile measures only Lucene collector time, it may not fully align with the took metric, especially for fetch phases or network delays.
Special handling includes:
Fetch‑phase slow queries are identified directly from slow logs.
Large profile sub‑stage times are used as optimization hints, not definitive conclusions.
The total took should exceed the sum of profile sub‑stage times; discrepancies guide further investigation.
3. DSL Inspection Rule Results
Daily inspection results are written to an Excel file, with each de‑parameterized DSL occupying a separate sheet. The file is then emailed to cluster owners via LDAP.
3.1 DSL Details
3.2 Rule Outcomes
Inspection rules fall into three categories:
Needs optimization – performance‑wasting query patterns are detected.
Needs analysis – certain stages exceed a time‑percentage threshold and require deeper review.
No optimization needed – query times are within acceptable limits.
3.3 Profile Information Display
The profile view highlights the slowest shard, main phases (rewrite, query, collector, aggs, other), and sub‑phases with color‑coded duration percentages.
4. Optimization Rules and Recommendations
4.1 Rule: Presence of term long Query
Numeric fields stored as
longuse BKD‑tree range lookup, which is fast for range queries but inefficient for exact matches that become
PointRangeQuery. Large result sets cause heavy CPU usage.
Detection: Sub‑stage type is
PointInSetQueryor
PointRangeQuery.
Suggestions:
Change the field type to
keywordfor exact matches.
Use proper
rangequeries for numeric ranges.
4.2 Rule: Presence of range keyword / wildcard Query
Keyword fields use inverted indexes; while fast for exact matches, range or wildcard queries on them degrade to full scans.
Detection: Sub‑stage type is
MultiTermQueryConstantScoreWrapper.
Suggestions:
Convert the field to a numeric type (e.g.,
long) for range queries.
For wildcard needs, consider an
ngramanalyzer or a dedicated
wildcardfield type.
5. Conclusion
The article presents a comprehensive solution for large‑scale Elasticsearch slow‑query governance. Automated inspection extracts actionable DSLs, combines them with Profile analysis, and applies rule‑based recommendations—such as avoiding
term longand
range keywordpatterns—to reduce investigation effort and improve overall backend performance.
Weimob Technology Center
Official platform of the Weimob Technology Center
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.