Artificial Intelligence 16 min read

Text2SQL Showdown: Which Technical Path Delivers Higher Accuracy and Lower Cost?

The article analyzes two contrasting Text2SQL architectures—LLM + RAG + DSL versus rule‑driven NLQ—examining their accuracy under controlled conditions, implementation costs, complex query support, and real‑world suitability for enterprise BI, and concludes which approach is more reliable and cost‑effective.

Past Memory Big Data

Dec 4, 2025

Text2SQL Showdown: Which Technical Path Delivers Higher Accuracy and Lower Cost?

Text‑to‑SQL accuracy

The reported >95% accuracy is not a universal metric. It is achieved only when several pre‑conditions are satisfied:

Fully built semantic layer – clear metric/dimension definitions, synonym mapping and business rules.

Domain‑specific language (DSL) as intermediate representation – more robust than direct SQL generation and can embed business logic.

Multi‑stage validation – syntax, semantics and result checks filter out many errors.

RAG supplies contextual knowledge – prevents the model from hallucinating metric definitions or calculation logic.

Test set reflects standardized internal queries – e.g., “last month GMV in East China”.

In a controlled environment with good data governance and semantic modeling, 95% is reachable. In open‑domain or zero‑shot scenarios without a semantic layer, public benchmarks such as Spider and BIRD show execution accuracy (ExAcc) around 60‑70%.

Academic benchmarks (Spider, BIRD) report state‑of‑the‑art ExAcc of 60‑70%. Commercial products (ThoughtSpot, Tableau Ask Data, Alibaba Quick BI, DataFocus) combine a semantic layer, rule fall‑backs and user‑feedback loops to reach >90% in customized deployments, but require extensive upfront configuration.

RAG cost and mitigation

Cost drivers include:

Knowledge material preparation (metric dictionaries, business‑term tables, calculation specifications).

Document chunking and vectorisation quality.

Query‑time relevance vs. noise trade‑off.

Key cost‑reduction tactics observed in practice:

Highly structured knowledge – instead of chunking PDFs, store business entities as JSON metadata (e.g., a GMV definition) and inject it directly into the RAG index.

Deep integration with the semantic layer – RAG matches against pre‑built candidate explanations rather than performing raw full‑text retrieval.

User‑feedback‑driven knowledge accumulation – each correction (e.g., “active user means logged‑in + ordered”) is automatically added to the knowledge base, creating a low‑cost “label‑as‑you‑use” loop.

Lightweight RAG – trigger retrieval only for key terms or fuzzy expressions, avoiding full‑text search for every query and thereby reducing latency and cost.

Technical approaches comparison

Two divergent architectures are contrasted:

Approach A (RAG + LLM → DSL) – LLM interprets intent, RAG provides context, and the model generates a machine‑oriented DSL (JSON/MQL) that is later executed.

Approach B (规范文本 + Rule Engine) – a rule‑driven pipeline first normalises natural language into a human‑readable “standard text”, then a deterministic engine converts it into executable queries.

Dimension comparison

LLM role – A: core generator of DSL/SQL; B: only performs natural‑language normalisation.

Intermediate representation – A: machine‑friendly JSON/MQL (hard for humans to read); B: human‑friendly “standard text” readable by both people and machines.

Accuracy assurance – A: relies on RAG context + post‑execution validation (syntax only); B: human‑verifiable + rule‑based deterministic conversion eliminates hallucinations.

Knowledge injection – A: RAG retrieves external docs / fine‑tuned models; B: NLQ dictionary (structured business‑data mapping) similar to a semantic layer.

Complex query support – A: theoretically supported but error rate rises sharply for joins, sub‑queries, window functions; B: explicit MQL/DQL/SPL layers handle multi‑table joins, aggregations and advanced metrics such as retention.

Accuracy comparison

Approach A can misinterpret terms, confuse dimensions or link wrong tables. The generated DSL is not human‑readable, so semantic errors may silently propagate; post‑execution checks catch only syntax errors. Approach B produces standard text that is immediately understandable (e.g., “2023 orders shipped from Beijing to Qingdao”), allowing users to confirm intent before deterministic rule conversion, which prevents hallucinations.

Implementation & operations cost

Approach A requires building a high‑quality RAG knowledge base, repeated prompt engineering, possible model fine‑tuning and debugging a black‑box pipeline. Approach B needs no LLM fine‑tuning; a generic LLM handles the simple normalisation step, while the core asset is an NLQ dictionary configured via visual tools, making the pipeline traceable and low‑cost.

Complex query capability

Approach A can generate nested SQL but error rates explode for multi‑table joins, sub‑queries or window functions, often leading to a restriction to single‑table or simple aggregates. Approach B’s MQL/DQL/SPL design explicitly supports joins, aggregations and advanced metrics, preserving accuracy while handling enterprise‑grade complexity.

Deployability & user experience

Approach A delivers instant charts suitable for exploratory analysis, but users lack visibility into errors, making it unsuitable for decision‑making. Approach B adds a “confirm standard text” step; users see exactly what the system understood, building trust for production use. The step can be optional for low‑risk queries or omitted entirely in a low‑cost mode.

Practical engineering tips

Highlight differences between the original query and the back‑translated text (e.g., red‑mark mismatched terms).

Provide an edit interface for users to correct the back‑translated text, triggering a re‑parse.

Log mis‑predictions to refine the RAG knowledge base or LLM prompts.

Enforce confirmation for high‑risk queries (financial metrics, YoY/MoM comparisons).

Key insight

Approach A tries to make the AI as smart as possible, while Approach B partitions work: humans handle the ambiguous parts, AI handles the deterministic transformation. In high‑risk, high‑accuracy enterprise settings, the latter yields a more sustainable engineering path.

{
  "term": "GMV",
  "definition": "Gross Merchandise Value, total transaction amount excluding refunds",
  "calculation": "SUM(order_amount) WHERE status IN ('paid', 'shipped')",
  "synonyms": ["成交额", "总交易额"]
}

Adding a confirmation step (DSL → natural language back‑translation) is technically feasible and improves trust, but it only mitigates errors that occur during DSL generation; it does not eliminate structural limitations such as poor complex‑query support or the inherent cost of maintaining a RAG knowledge base.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Rule Engine dsl LLM RAG Business Intelligence semantic-layer Text2SQL AI+Rules

Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.