Text2DSL: Convert Natural Language to Precise Elasticsearch/Easysearch DSL

Text2DSL lets users describe search requirements in plain language, uses DeepSeek to generate Elasticsearch DSL, validates the DSL locally with Elasticsearch/Easysearch, iteratively refines it up to five times, and achieves over 95% first‑try accuracy while cutting query‑building time by at least threefold.

Mingyi World Elasticsearch
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Text2DSL: Convert Natural Language to Precise Elasticsearch/Easysearch DSL

Why Build This Tool?

Writing Elasticsearch DSL manually is error‑prone. Common problems include mismatched brackets, misspelled field names, forgotten commas, slow reference to official documentation for bool queries and nested aggregations, and difficulty debugging complex queries.

Hand‑written DSL easily contains syntax errors.

Consulting the official docs for each query type is time‑consuming.

Deeply nested aggregations become unreadable and fragile.

Existing converters cannot guarantee that the generated DSL is 100 % correct.

The solution generates DSL and immediately validates it against a local Elasticsearch or Easysearch instance; if validation fails, the tool regenerates the DSL.

Overall Design

Core Workflow

User input → DeepSeek API generates DSL → local Elasticsearch validates → if validation passes, the DSL is returned; otherwise the error is fed back to DeepSeek for regeneration, repeating up to five times.

The key is local validation . Many tools only generate DSL without checking correctness.

This tool creates a temporary index, writes test data, executes the DSL, and deletes the index regardless of success or failure.

Technology Choices

Backend: Flask, chosen for its lightweight nature.

LLM: DeepSeek API, offering roughly one‑tenth the cost of OpenAI, which is economical for multiple calls.

Search engine: Elasticsearch 9.0+ (compatible with Easysearch 2.0+).

Four Core Features

Natural Language → DSL

The system’s core converts a description such as "find documents whose title contains Elasticsearch" into a standard Elasticsearch DSL. Prompt engineering is critical; the prompt explicitly instructs the model to output pure JSON, avoid Markdown code fences, specify the Elasticsearch version, and provide common query patterns. First‑generation accuracy exceeds 95 %.

Local Validation Mechanism

The validation process consists of four steps:

Create a temporary test index. The index name includes a timestamp (e.g., text2dsl_test_1705234567) to avoid conflicts.

Define a mapping with common field types: text, keyword, integer, double, boolean, date.

Insert 3‑5 synthetic documents covering all field types so that the query has data to match.

Execute the DSL via the Elasticsearch/Easysearch search API, capture results or error messages, and finally delete the temporary index.

This mechanism catches syntax errors, type mismatches, and logical problems immediately.

Iterative Optimization

If validation fails, the error message (e.g., "field [count] is not sortable") is fed back to DeepSeek, which regenerates a corrected DSL. Typically 1‑2 iterations succeed; the most complex query required three attempts. The maximum iteration count is set to 5 to prevent endless loops and control API cost. Each iteration is logged, allowing users to see how the DSL evolves.

Frontend Interface

The UI is divided into four areas:

Input area: a text box for natural‑language description, a dropdown to select operation type (query or aggregation), and a generate button.

DSL display area: CodeMirror editor showing the generated JSON DSL with syntax highlighting.

Result display area: JSON response from Elasticsearch, showing matched documents or aggregation results.

Iteration history area: shows the number of iterations and the reason for each optimization.

The interface is responsive and works on both PC and mobile.

Real‑World Use Cases

Basic Match Query

Input: "Find documents with titles containing Elasticsearch". The generated DSL is a standard match query, validated on the first try with zero iterations.

Combined Query

Input: "Find titles containing 'test' and views greater than 100". The tool produces a bool query with must clauses for match and range.

OR Combination + Sorting

Input: "Find documents authored by 'Zhang San' or 'Li Si', sorted by publish time descending". The generated DSL includes a should clause and a sort definition.

Complex Nested Logic + Aggregation + Script Field + Pagination

Requirements: title contains "测试" or "验证", status is "已发布", views > 100, sorted by views descending, page 2 with 10 items, and a terms aggregation on category. The tool automatically repairs the DSL and validates it.

POST text2dsl_test_1769784751/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "测试" } },
        { "term": { "status": "已发布" } },
        { "range": { "views": { "gt": 100 } } }
      ]
    }
  },
  "sort": [{ "views": { "order": "desc" } }],
  "from": 10,
  "size": 10,
  "aggs": {
    "category_count": { "terms": { "field": "category" } }
  }
}

Pitfalls Encountered

Pitfall 1: DeepSeek Returns Markdown‑Wrapped JSON

Even though the prompt forbids code fences, occasional back‑ticks appear. The solution is to strip back‑ticks and the optional "json" marker before parsing the JSON.

Pitfall 2: Test Index Not Deleted, Causing Name Collisions

Earlier versions left orphaned indices, leading to creation errors. Using a timestamped index name and a finally‑block cleanup guarantees deletion.

Pitfall 3: Aggregation Queries Missing size: 0

Missing size: 0 caused large document payloads. Adding an explicit prompt instruction to set size: 0 for pure aggregation queries eliminates the issue.

Observed Benefits

Overall accuracy is around 95 %; the remaining 5 % typically require one or two optimization loops.

Common match, term, range, and bool queries succeed on the first generation. Complex nested aggregations may need 1‑2 iterations.

Productivity improves by at least threefold: a query that previously took 5‑10 minutes now takes 1‑2 minutes.

The tool is especially valuable for non‑technical users (product managers, project managers) who can describe requirements in plain language, obtain correct DSL, and simultaneously learn Elasticsearch query syntax.

Future Plans

DSL Template Library

Collect common query patterns so users can select and modify templates directly.

Query History Storage

Persist users' query history for easy reuse and sharing.

Code Export

Generate equivalent calls in Python, Java, curl, and other languages in addition to raw JSON.

Conclusion

The core value of the tool is "generate + validate". Generation relies on DeepSeek; validation runs on a local Elasticsearch/Easysearch instance; iterative refinement guarantees near‑100 % correctness.

For frequent Elasticsearch users, the tool dramatically reduces the need to read documentation or debug JSON, allowing them to "just speak in natural language" to obtain a working DSL.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchnatural language processingDeepSeekFlaskEasysearchDSL generationQuery validation
Mingyi World Elasticsearch
Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.