Boosting Text-to-SQL Accuracy: J‑Schema, Iterative DPO, and Self‑Consistency
This paper presents a comprehensive approach to improve Text‑to‑SQL performance by introducing J‑Schema for structured database representation, leveraging chain‑of‑thought prompting, applying iterative DPO training, and employing self‑consistency voting, achieving execution accuracy gains from 56.6% to 69.2% on the BIRD benchmark.
Abstract
Technical background: Text2SQL converts natural language queries to SQL, evolving through rule‑based, neural, pretrained language model, and large language model stages. Current challenges are prompt optimization, model training, and inference enhancement, investigated on the BIRD dataset.
Method: Propose J‑Schema to present database structure with example values, combined with chain‑of‑thought prompting. Use Iterative DPO for multi‑round training, and a self‑consistency voting mechanism (hard/soft) to select the best answer, with soft voting performing better.
Result: Address three major challenges, raising execution accuracy on BIRD from 56.6% to 69.2%.
1. Text2SQL Challenges
Text‑to‑SQL (NL2SQL) aims to generate SQL queries from natural language that correctly reflect user intent and execute on relational databases. It enables non‑expert users to access complex datasets.
The evolution includes four stages: 1) Rule‑based, focusing on single‑table queries; 2) Neural networks (seq2seq, GNN) improving synonym handling and multi‑table support; 3) Pretrained language models (BERT, T5) enhancing natural language understanding; 4) Large language models (LLMs) with strong emergent abilities, shifting research toward prompt engineering and fine‑tuning.
Current difficulties:
Prompt optimization: How to guide LLMs to produce clear reasoning and design schemas that are easy for the model to understand?
Model training: How to improve the model’s fundamental capabilities through training?
Inference enhancement: How to make LLM outputs more stable and reliable?
The BIRD benchmark contains over 12,751 question‑SQL pairs across 95 large databases and 37 domains, providing a comprehensive testbed.
2. Prompt & J‑Schema
We introduce J‑Schema, a fully structured representation of database schemas using special markers such as #DB_ID, #Table, and #Foreign keys. Each table lists its name, and columns are described with basic_info including name, description, primary‑key flag, and example values.
Example value rules:
For date/time types (DATE, TIME, DATETIME, TIMESTAMP) keep a single example.
For floating‑point type (REAL) keep a two‑decimal example.
For integer type keep one example.
For text type keep examples based on token length; if fewer than 10 tokens, provide multiple examples.
These rules are derived from analysis of erroneous samples, showing that richer examples for date, float, and integer have limited impact, while diverse examples for text improve retrieval.
Chain‑of‑thought prompting is used: the full database information is provided, followed by the user query and external knowledge, then the model is instructed to generate reasoning steps inside <think> … </think> and the final SQL inside <answer> … </answer>.
3. Training Method
Iterative DPO
Preference optimization (DPO) aligns pretrained models with human preferences more effectively than supervised fine‑tuning. We apply an iterative DPO process: after each iteration, the updated model generates new preference pairs from sampled chain‑of‑thought steps and answers, which are used for the next round of training.
Performance improves across iterations, peaking at the third stage.
Model Avg CoT tokens Exec Acc
Qwen2.5-Coder-32B 334 63.69%
iterative stage1 377 65.78%
iterative stage2 377 67.08%
iterative stage3 380 67.60%
iterative stage4 384 67.40%Hyper‑parameter Scan
We vary the DPO loss weight β from 0.1 to 0.6 (other settings fixed). The highest execution accuracy is achieved at β = 0.5.
4. Self‑Consistency
Self‑consistency generates multiple candidate answers for the same question and selects the best via voting, rather than relying on a single generation.
Two voting strategies:
Hard voting: based on binary correctness of each answer.
Soft voting: considers similarity between answers, allowing semantically equivalent or near‑correct results to contribute.
Soft voting consistently outperforms hard voting, yielding over 1% accuracy gain.
Model No SC Hard Vote Soft Vote
iterative stage3 67.60% 67.93% 68.97%
iterative stage4 67.40% 67.40% 68.45%5. Future Directions
Potential work includes constructing higher‑quality data from the 2.5M‑size SynSQL‑2.5M dataset, exploring alternative training methods such as GRPO, refining label quality with LLM‑as‑Judge, and evaluating on additional benchmarks like Spider, ScienceBenchmark, and EHRSQL.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
