Artificial Intelligence 2 min read

Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations

This work tackles the difficulty of incorporating extensive domain knowledge into in‑domain NL2SQL tasks by proposing an intermediate‑representation‑based data synthesis method that decouples knowledge compliance from SQL generation, enabling automated creation of high‑quality training data with 60× human efficiency and over 97% accuracy.

DataFunSummit

Jun 6, 2025

Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations

In in‑domain NL2SQL tasks, abundant domain knowledge often becomes a bottleneck: it is hard to retrieve diverse knowledge precisely, and it is uncertain whether large language models (LLMs) can follow that knowledge when responding.

To address these issues, we fine‑tune LLMs using a novel data‑synthesis approach based on an intermediate representation. This representation decouples the model's ability to obey domain knowledge from its SQL generation capability, allowing automatic generation of large amounts of high‑quality data.

Experiments show that the proposed method produces data 60 times faster than manual annotation, achieves a synthesis accuracy exceeding 97%, and outperforms human experts by 7 percentage points on all evaluation metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models Data Synthesis SQL Generation NL2SQL domain knowledge

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.