How Ant Group’s DeepInsight Boosted Text‑to‑SQL Accuracy by 46% with an AI‑Driven Evaluation Framework

This article details Ant Group’s DeepInsight intelligent evaluation system for Chinese Text‑to‑SQL, describing the AI‑BI background, challenges of existing benchmarks, a feature‑annotated evaluation design, automated dataset generation, experimental results showing a 46% accuracy gain and 71% reduction in failure rate, and future research directions.

DataFunSummit
DataFunSummit
DataFunSummit
How Ant Group’s DeepInsight Boosted Text‑to‑SQL Accuracy by 46% with an AI‑Driven Evaluation Framework

Background and Evaluation Challenges in AI+BI

Natural‑language data retrieval lowers the barrier for non‑technical users, but its practical adoption is limited by low execution accuracy. Ant Group built an intelligent evaluation system that jointly models SQL computation features (e.g., aggregation, joins, sub‑queries) and semantic expression features (e.g., synonyms, abbreviations, domain knowledge). The system addresses two critical gaps in the Chinese Text‑to‑SQL ecosystem: the scarcity of high‑quality Chinese benchmarks and the mismatch between public datasets and real‑world enterprise schemas. Internal experiments show a 46 % increase in execution accuracy and a 71 % reduction in non‑executable cases .

Specific Difficulties of Chinese Text‑to‑SQL

Chinese lacks explicit word boundaries, requiring a reliable tokenization step; tokenization errors directly degrade semantic understanding.

Expressions are often implicit, context‑dependent, and contain omissions, making schema linking (e.g., mapping "产品名" to product_name) more error‑prone.

Most pretrained models and resources are English‑centric, so transfer performance to Chinese is limited.

Schema linking must handle “Chinese‑style” column names and noisy data, which are rare in English benchmarks.

Enterprise‑Specific Requirements

Real‑world enterprise databases typically contain thousands of columns (compared with the Spider benchmark average of 27.1 columns) and exhibit:

Non‑standard column naming conventions and dirty data.

Deep business logic linking fields (e.g., "订单金额" vs. "实付金额").

Domain‑specific terminology in e‑commerce, finance, and healthcare that is absent from generic models.

A benchmark that reflects these nuances is essential for reliable model assessment.

Limitations of Existing Public Benchmarks

Predominantly English, with few high‑quality Chinese resources.

Scale and business logic do not match enterprise schemas (Spider’s average schema size is far smaller than production databases).

Absence of a fine‑grained feature‑annotation system, preventing difficulty analysis and targeted optimization.

Feature‑Annotated Evaluation Framework

The proposed framework introduces a two‑dimensional annotation scheme:

SQL Computation Features : tags for query structure (SELECT, GROUP BY, HAVING), aggregation functions, join types, sub‑queries, window functions, etc.

SQL Semantic Expression Features : a hierarchy ranging from exact token matches to advanced knowledge‑dependent transformations, including:

Synonyms / near‑synonyms

Spelling errors, homophones, and abbreviations

Unit conversions (e.g., "万元" → 10 000)

Domain defaults (e.g., "区域" defaults to OU region in finance)

Ambiguous fields requiring disambiguation (e.g., "金额" could refer to order amount or actual payment)

Enumeration mapping (e.g., code "1" → "Male" based on data dictionary)

This systematic labeling captures both low‑level syntactic nuances and high‑level semantic complexities of Chinese queries.

Automated Benchmark Generation Pipeline

Seed Question Generation :

Sample schema metadata and model‑specific feature sets.

Generate initial natural‑language questions.

Human reviewers filter for relevance and quality, producing a high‑quality seed set.

Synthetic Question Expansion :

Feed the seed set, schema definitions, and the full SQL‑feature catalogue into a Reflexion module.

The Reflexion module iteratively rewrites questions, introduces paraphrases, and injects controlled noise (e.g., omissions, synonyms) to create a large pool of synthetic questions while preserving the original feature distribution.

Answer Annotation :

Execute sample‑n inference with multiple LLMs to obtain candidate SQL statements.

Human annotators verify correctness, assign a difficulty level (easy, medium, hard) based on the annotated features, and perform cross‑annotation to ensure consistency.

For production‑grade evaluation, the first stage draws questions directly from live user logs, guaranteeing that the final benchmark mirrors the true distribution of query types, difficulty, and feature coverage.

Evaluation Results and Practical Applications

Using the open‑source Falcon benchmark, major LLMs were evaluated:

Model                Correctness (%)
--------------------------------
DeepSeek‑R1               45.2
o1                         43.0
o3mini                     42.2
Claude‑3.7‑sonnet‑Thinking 41.0
GPT‑4.1                   40.2
Claude‑3.7‑sonnet         40.0

By contrast, the best score on Spider 2.0 is ~20 %, demonstrating that Falcon’s difficulty design is more realistic for enterprise scenarios.

Continuous internal evaluation, combined with prompt engineering, parameter tuning, Reflexion‑based self‑correction, and semantic validation, yielded a 46 % boost in execution accuracy and a **71 % drop in non‑executable cases** for the Copilot‑style self‑service analytics product.

The evaluation capability has been productized: business units can upload custom benchmark sets, run self‑service tests, and visualize knowledge recall, metric recall, and model inference paths for targeted debugging.

Future Directions

Support multi‑turn dialogue evaluation and fuzzy semantic scenarios.

Incorporate knowledge‑dependent cases (e.g., default domain values, external dictionaries).

Extend from pure data retrieval to full‑pipeline intelligent analysis and interpretation.

Open‑source additional Chinese Text‑to‑SQL benchmarks and establish a community‑driven leaderboard.

Conclusion

The DeepInsight intelligent evaluation framework provides a reproducible methodology for AI+BI assessment in the Chinese Text‑to‑SQL domain. By combining fine‑grained feature annotation, automated benchmark generation, and product‑level integration, it delivers measurable performance gains and a scalable practice that can be adopted across the industry.

AIlarge language modelsdata analyticsbenchmarkText-to-SQL
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.