Artificial Intelligence 20 min read

Semantic Parsing for Text-to-SQL: Datasets, Models, Evaluation, and Applications

This article reviews the Text-to-SQL semantic parsing task, covering its motivation, dataset landscape, major model architectures such as pointer networks, sequence‑to‑set, and grammar‑based approaches, evaluation metrics, the newly built DuSQL dataset and DuParser system, real‑world deployments, and remaining research challenges.

DataFunTalk
DataFunTalk
DataFunTalk
Semantic Parsing for Text-to-SQL: Datasets, Models, Evaluation, and Applications

Semantic parsing (Semantic Parsing) is a core natural‑language‑processing task that maps language to executable database queries; the Text‑to‑SQL sub‑task automatically converts a user’s natural‑language question into a SQL statement that can be run on a relational database.

The task consists of a parser, which receives a database schema and a question and outputs the corresponding SQL query, and an executor that runs the query; this article focuses on the parser, describing the mapping from question to database elements and the generation of syntactically correct SQL.

We summarize the evolution of Text‑to‑SQL datasets: early single‑domain corpora, multi‑domain collections such as Spider, CoSQL, NL2SQL, CSpider, and the newly constructed DuSQL dataset, which classifies data by domain count, single/multi‑table, question complexity, round‑trip count, and conversational extensions.

Model progress is organized into four categories: (1) Pointer‑Network‑based models (Seq2SQL, STAMP, Coarse2Fine, IRNet) that copy tokens from the input; (2) Sequence‑to‑set approaches (SQLNet, TypeSQL, SQLova, X‑SQL) that predict column‑keyword assignments; (3) Grammar‑based methods (TRANX) that generate abstract‑syntax‑tree actions; and (4) other enhancements such as graph‑neural‑network encoders (Global GNN, RATSQL) and execution‑guided decoding. Example generated SQL: Select 名称,所属省 From 中国城市 Where 绿化率 > 30% .

Evaluation uses two metrics: Exact Match (Acc qm ) which checks set‑wise equality of SQL components, and Execution Accuracy (Acc ex ) which measures whether the query returns the correct answer.

DuSQL was built by harvesting 813 tables from encyclopedic sources, clustering them into 200 databases, and semi‑automatically generating about 24 000 pairs using a grammar‑driven SQL generator followed by crowdsourced natural‑language paraphrasing.

DuParser combines three modules: component mapping (identifying table/column mentions), label identification (SQL‑keyword tagging via a sequence‑to‑set model), and grammar combination (a CYK‑style parser that assembles a syntax tree from a set of grammar rules, selecting the best rule by similarity between SQL fragments and question fragments). The framework offers interpretability, controllability, and domain‑agnostic generalization.

Real‑world deployments at Baidu include B2B customer‑service platforms (e.g., UNIT) and search‑engine scenarios, where Text‑to‑SQL powers structured Q&A, conversational query refinement, and enterprise table‑based search.

Remaining challenges are table‑recognition and normalization, incorporation of external knowledge for operations such as sorting or unit conversion, and extending the system to multi‑turn conversational settings that require context‑aware understanding.

AIDatabaseNatural Language ProcessingText-to-SQLsemantic parsing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.