Beyond 100% Accuracy: Key Metrics to Evaluate in Text2SQL Systems

The article argues that a 100% accuracy claim for Text2SQL is misleading without considering stability, coverage, and pass‑rate metrics, and it details a deterministic NLQ pipeline that converts natural language to a verifiable intermediate format before rule‑based SQL compilation.

Past Memory Big Data
Past Memory Big Data
Past Memory Big Data
Beyond 100% Accuracy: Key Metrics to Evaluate in Text2SQL Systems

Why Accuracy Alone Is Misleading

In Text2SQL, reported accuracies of 90%–95% often hide the denominator; accuracy drops sharply when queries involve multi‑table JOINs or sub‑queries, falling to 50%‑60% or worse. Moreover, the metric is unstable: the same model may answer correctly today and fail tomorrow, making a single‑number accuracy unreliable for product teams.

润乾 NLQ: A Different Pipeline

Instead of letting a large language model (LLM) directly generate SQL, 润乾 NLQ first translates the user's spoken query into a standardized intermediate text (规范文本) that both humans and machines can understand. A rule engine then deterministically compiles this text into SQL, guaranteeing 100% correctness for the compilation step.

Example conversion:

User utterance: 帮我查查 去年上半年签单的客户有哪些

Standardized text: 去年 上半年 签单 客户

The standardized text acts as a “human‑readable, machine‑readable” middle language, eliminating probabilistic guessing.

TPCH Benchmark Demonstration

A TPCH dataset with 8 tables and up to tens of thousands of rows was used to test the system. Sample queries and the generated SQL are shown below.

Simple single‑table filter:

SELECT
	T_1."C_CUSTKEY" "客户编号",
	T_1."C_PHONE" "电话",
	T_1."C_MKTSEGMENT" "市场细分",
	T_1."C_NAME" "名称",
	T_1."C_ACCTBAL" "账户余额",
	T_1."C_NATIONKEY" "所在国家"
FROM CUSTOMER T_1
WHERE (T_1."C_NATIONKEY" = 18)

TOP‑3 order total query:

SELECT
	T_1."O_TOTALPRICE" "订单总金额最大的",
	T_1."O_ORDERKEY" "订单编码",
	T_1."O_ORDERSTATUS" "订单状态",
	T_1."O_ORDERDATE" "订单日期",
	T_1."O_CLERK" "业务员"
FROM ORDERS T_1
ORDER BY T_1."O_TOTALPRICE" DESC
LIMIT 3

Complex multi‑table join with aggregation:

SELECT
    T_1."C_CUSTKEY" "客户编号",
    T_1."C_NATIONKEY" "所在国家",
    T_1."C_NAME" "名称",
    T_1."C_PHONE" "电话",
    T_1."C_ACCTBAL" "账户余额",
    T_1."C_MKTSEGMENT" "市场细分",
    T_2.F_2 "订单总金额总和"
FROM CUSTOMER T_1
LEFT JOIN (
    SELECT T_2."O_CUSTKEY" F_1,
           sum(T_2."O_TOTALPRICE") F_2
    FROM ORDERS T_2
    WHERE YEAR(T_2."O_ORDERDATE") = YEAR(DATEADD('yy', -1, NOW()))
    GROUP BY T_2."O_CUSTKEY"
) T_2 ON T_1."C_CUSTKEY" = T_2.F_1
WHERE T_2.F_2 > 100000 AND T_1."C_NATIONKEY" = 18

All these queries execute correctly, confirming the 100% compilation accuracy claim.

Coverage, Pass‑Rate, and Limitations

润乾 NLQ defines four query patterns—single‑table detail, single‑table aggregation, master‑detail, and multi‑dimensional alignment—that cover most BI use cases. However, users often speak in pure natural language, which the system cannot handle directly; the pass‑rate for raw utterances is low without an LLM.

By inserting an LLM to translate spoken queries into standardized text, the pass‑rate improves dramatically. The LLM’s instability is mitigated because its output is always reviewed by a human (or a verification step) before rule‑based compilation.

Some query types remain unsupported even after standardization, such as explicit sorting commands, cross‑row calculations (e.g., month‑over‑month growth), multi‑entity intersections, predictive forecasts, and data‑modifying statements.

Extending Capability with NLR

For operations that the NLQ layer cannot express (e.g., computing growth rates), a second module called NLR (Natural Language Reporting) consumes the result set from NLQ and performs further calculations using natural language commands.

Example to compute month‑over‑month growth:

计算 订单金额求和 比例环比 命名为 增长率
位置 增长率 显示为 百分数格式

This combination enables end‑to‑end analytics without writing SQL or clicking UI elements.

Why This Approach Stands Out

Deterministic Accuracy: The standardized‑text‑to‑SQL step is rule‑based, guaranteeing 100% correctness once the intermediate text is verified.

Improved Pass‑Rate: Adding an LLM for the translation step raises the proportion of executable queries while keeping the unstable part confined before verification.

Low Implementation Cost: Ordinary database engineers can set up the system in a few days; no AI experts or GPU servers are required.

Clear, Explainable Architecture: DQL metadata removes the pain of multi‑table joins, the NLQ dictionary handles natural‑language mapping, and NLR covers complex calculations, making the whole pipeline transparent.

For teams evaluating Text2SQL solutions, the article suggests reviewing 润乾 NLQ’s technical documentation as a concrete, controllable alternative to black‑box LLM‑only approaches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

sqlAIDatabaseBenchmarkAccuracyText2SQLNLQ
Past Memory Big Data
Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.