Boosting Text-to-SQL Accuracy with J‑Schema, Iterative DPO, and Self‑Consistency
This article examines the evolution of Text-to-SQL, introduces the J‑Schema representation and chain-of-thought prompting, applies iterative DPO training and self-consistency voting, and demonstrates how these techniques raise execution accuracy on the BIRD benchmark from 56.6% to 69.2%.
Introduction
Text-to-SQL (NL2SQL) converts natural-language questions into executable SQL queries over a relational database, enabling non-technical users to retrieve information from large datasets.
Evolution of Text-to-SQL
The field has progressed through four stages:
Rule-based stage: Early systems relied on handcrafted rules and statistical language models, handling only single-table queries.
Neural-network stage: Sequence and graph neural models improved synonym handling and multi-table reasoning but were limited by model size and data.
Pre-trained language model stage: Models such as BERT and T5 brought substantial gains in language understanding.
Large-language-model stage: LLMs now dominate, with research focusing on prompt engineering and fine-tuning.
Current Challenges
Three major difficulties remain:
Prompt optimization: How to guide LLMs to produce clear reasoning steps and design database schemas that are easy for the model to understand.
Model training: How to improve the base capabilities of the model through training.
Inference enhancement: How to make the model’s output more stable and reliable.
J-Schema and Prompt Design
We propose J-Schema , a fully structured representation of database schemas that uses special markers such as #DB_ID, #Table, and #Foreign keys. Each table includes its name, and each column is described with basic_info containing the column name, description, primary-key flag, and example values.
To balance the number of example values with context length, we apply the following rules:
For date/time types (DATE, TIME, DATETIME, TIMESTAMP) keep a single example.
For floating-point type (REAL) keep two decimal places as one example.
For integer type (INTEGER) keep one example.
For text type (TEXT) keep multiple examples if the token length is less than 10.
Chain-of-Thought Prompting
We embed the complete database information, the user query, and external knowledge in the prompt, then ask the model to generate intermediate reasoning steps inside <think>…</think> tags and the final SQL answer inside <answer>…</answer> tags.
Iterative DPO Training
We adopt an iterative Direct Preference Optimization (DPO) procedure. In each iteration we sample multiple chain-of-thought reasoning traces and final answers, construct positive and negative preference pairs, and fine-tune the model. Repeating this process yields progressively higher execution accuracy, saturating after the third iteration.
Hyper-parameter scanning shows that a DPO loss weight (beta) of 0.5 gives the best execution accuracy.
Self-Consistency Voting
We generate multiple candidate SQL answers for each question and select the best one using voting. Two strategies are explored:
Hard voting: Binary correctness decides the vote.
Soft voting: Votes are weighted by similarity between candidates.
Soft voting consistently outperforms hard voting, providing over 1% absolute improvement in execution accuracy across checkpoints.
Future Directions
We plan to:
Construct higher-quality data by filtering the 2.5 M-sample SynSQL-2.5M dataset for relevance to BIRD.
Explore alternative training methods such as GRPO, which uses group-wise relative rewards without explicit preference pairs.
Validate the proposed optimizations on additional benchmarks like Spider, ScienceBenchmark, and EHRSQL, and eventually apply them to real-world DataAgent scenarios.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
