Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations
This work tackles the difficulty of incorporating extensive domain knowledge into in‑domain NL2SQL tasks by proposing an intermediate‑representation‑based data synthesis method that decouples knowledge compliance from SQL generation, enabling automated creation of high‑quality training data with 60× human efficiency and over 97% accuracy.
In in‑domain NL2SQL tasks, abundant domain knowledge often becomes a bottleneck: it is hard to retrieve diverse knowledge precisely, and it is uncertain whether large language models (LLMs) can follow that knowledge when responding.
To address these issues, we fine‑tune LLMs using a novel data‑synthesis approach based on an intermediate representation. This representation decouples the model's ability to obey domain knowledge from its SQL generation capability, allowing automatic generation of large amounts of high‑quality data.
Experiments show that the proposed method produces data 60 times faster than manual annotation, achieves a synthesis accuracy exceeding 97%, and outperforms human experts by 7 percentage points on all evaluation metrics.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.