Text2DSL: Convert Natural Language to Precise Elasticsearch/Easysearch DSL
Text2DSL lets users describe search requirements in plain language, uses DeepSeek to generate Elasticsearch DSL, validates the DSL locally with Elasticsearch/Easysearch, iteratively refines it up to five times, and achieves over 95% first‑try accuracy while cutting query‑building time by at least threefold.
Why Build This Tool?
Writing Elasticsearch DSL manually is error‑prone. Common problems include mismatched brackets, misspelled field names, forgotten commas, slow reference to official documentation for bool queries and nested aggregations, and difficulty debugging complex queries.
Hand‑written DSL easily contains syntax errors.
Consulting the official docs for each query type is time‑consuming.
Deeply nested aggregations become unreadable and fragile.
Existing converters cannot guarantee that the generated DSL is 100 % correct.
The solution generates DSL and immediately validates it against a local Elasticsearch or Easysearch instance; if validation fails, the tool regenerates the DSL.
Overall Design
Core Workflow
User input → DeepSeek API generates DSL → local Elasticsearch validates → if validation passes, the DSL is returned; otherwise the error is fed back to DeepSeek for regeneration, repeating up to five times.
The key is local validation . Many tools only generate DSL without checking correctness.
This tool creates a temporary index, writes test data, executes the DSL, and deletes the index regardless of success or failure.
Technology Choices
Backend: Flask, chosen for its lightweight nature.
LLM: DeepSeek API, offering roughly one‑tenth the cost of OpenAI, which is economical for multiple calls.
Search engine: Elasticsearch 9.0+ (compatible with Easysearch 2.0+).
Four Core Features
Natural Language → DSL
The system’s core converts a description such as "find documents whose title contains Elasticsearch" into a standard Elasticsearch DSL. Prompt engineering is critical; the prompt explicitly instructs the model to output pure JSON, avoid Markdown code fences, specify the Elasticsearch version, and provide common query patterns. First‑generation accuracy exceeds 95 %.
Local Validation Mechanism
The validation process consists of four steps:
Create a temporary test index. The index name includes a timestamp (e.g., text2dsl_test_1705234567) to avoid conflicts.
Define a mapping with common field types: text, keyword, integer, double, boolean, date.
Insert 3‑5 synthetic documents covering all field types so that the query has data to match.
Execute the DSL via the Elasticsearch/Easysearch search API, capture results or error messages, and finally delete the temporary index.
This mechanism catches syntax errors, type mismatches, and logical problems immediately.
Iterative Optimization
If validation fails, the error message (e.g., "field [count] is not sortable") is fed back to DeepSeek, which regenerates a corrected DSL. Typically 1‑2 iterations succeed; the most complex query required three attempts. The maximum iteration count is set to 5 to prevent endless loops and control API cost. Each iteration is logged, allowing users to see how the DSL evolves.
Frontend Interface
The UI is divided into four areas:
Input area: a text box for natural‑language description, a dropdown to select operation type (query or aggregation), and a generate button.
DSL display area: CodeMirror editor showing the generated JSON DSL with syntax highlighting.
Result display area: JSON response from Elasticsearch, showing matched documents or aggregation results.
Iteration history area: shows the number of iterations and the reason for each optimization.
The interface is responsive and works on both PC and mobile.
Real‑World Use Cases
Basic Match Query
Input: "Find documents with titles containing Elasticsearch". The generated DSL is a standard match query, validated on the first try with zero iterations.
Combined Query
Input: "Find titles containing 'test' and views greater than 100". The tool produces a bool query with must clauses for match and range.
OR Combination + Sorting
Input: "Find documents authored by 'Zhang San' or 'Li Si', sorted by publish time descending". The generated DSL includes a should clause and a sort definition.
Complex Nested Logic + Aggregation + Script Field + Pagination
Requirements: title contains "测试" or "验证", status is "已发布", views > 100, sorted by views descending, page 2 with 10 items, and a terms aggregation on category. The tool automatically repairs the DSL and validates it.
POST text2dsl_test_1769784751/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "测试" } },
{ "term": { "status": "已发布" } },
{ "range": { "views": { "gt": 100 } } }
]
}
},
"sort": [{ "views": { "order": "desc" } }],
"from": 10,
"size": 10,
"aggs": {
"category_count": { "terms": { "field": "category" } }
}
}Pitfalls Encountered
Pitfall 1: DeepSeek Returns Markdown‑Wrapped JSON
Even though the prompt forbids code fences, occasional back‑ticks appear. The solution is to strip back‑ticks and the optional "json" marker before parsing the JSON.
Pitfall 2: Test Index Not Deleted, Causing Name Collisions
Earlier versions left orphaned indices, leading to creation errors. Using a timestamped index name and a finally‑block cleanup guarantees deletion.
Pitfall 3: Aggregation Queries Missing size: 0
Missing size: 0 caused large document payloads. Adding an explicit prompt instruction to set size: 0 for pure aggregation queries eliminates the issue.
Observed Benefits
Overall accuracy is around 95 %; the remaining 5 % typically require one or two optimization loops.
Common match, term, range, and bool queries succeed on the first generation. Complex nested aggregations may need 1‑2 iterations.
Productivity improves by at least threefold: a query that previously took 5‑10 minutes now takes 1‑2 minutes.
The tool is especially valuable for non‑technical users (product managers, project managers) who can describe requirements in plain language, obtain correct DSL, and simultaneously learn Elasticsearch query syntax.
Future Plans
DSL Template Library
Collect common query patterns so users can select and modify templates directly.
Query History Storage
Persist users' query history for easy reuse and sharing.
Code Export
Generate equivalent calls in Python, Java, curl, and other languages in addition to raw JSON.
Conclusion
The core value of the tool is "generate + validate". Generation relies on DeepSeek; validation runs on a local Elasticsearch/Easysearch instance; iterative refinement guarantees near‑100 % correctness.
For frequent Elasticsearch users, the tool dramatically reduces the need to read documentation or debug JSON, allowing them to "just speak in natural language" to obtain a working DSL.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
