Artificial Intelligence 12 min read

Boosting Text-to-SQL Accuracy: J‑Schema, Iterative DPO, and Self‑Consistency

This paper presents a comprehensive approach to improve Text‑to‑SQL performance by introducing J‑Schema for structured database representation, leveraging chain‑of‑thought prompting, applying iterative DPO training, and employing self‑consistency voting, achieving execution accuracy gains from 56.6% to 69.2% on the BIRD benchmark.

JD Tech Talk

Aug 14, 2025

Boosting Text-to-SQL Accuracy: J‑Schema, Iterative DPO, and Self‑Consistency

Abstract

Technical background: Text2SQL converts natural language queries to SQL, evolving through rule‑based, neural, pretrained language model, and large language model stages. Current challenges are prompt optimization, model training, and inference enhancement, investigated on the BIRD dataset.

Method: Propose J‑Schema to present database structure with example values, combined with chain‑of‑thought prompting. Use Iterative DPO for multi‑round training, and a self‑consistency voting mechanism (hard/soft) to select the best answer, with soft voting performing better.

Result: Address three major challenges, raising execution accuracy on BIRD from 56.6% to 69.2%.

1. Text2SQL Challenges

Text‑to‑SQL (NL2SQL) aims to generate SQL queries from natural language that correctly reflect user intent and execute on relational databases. It enables non‑expert users to access complex datasets.

The evolution includes four stages: 1) Rule‑based, focusing on single‑table queries; 2) Neural networks (seq2seq, GNN) improving synonym handling and multi‑table support; 3) Pretrained language models (BERT, T5) enhancing natural language understanding; 4) Large language models (LLMs) with strong emergent abilities, shifting research toward prompt engineering and fine‑tuning.

Current difficulties:

Prompt optimization: How to guide LLMs to produce clear reasoning and design schemas that are easy for the model to understand?

Model training: How to improve the model’s fundamental capabilities through training?

Inference enhancement: How to make LLM outputs more stable and reliable?

The BIRD benchmark contains over 12,751 question‑SQL pairs across 95 large databases and 37 domains, providing a comprehensive testbed.

2. Prompt & J‑Schema

We introduce J‑Schema, a fully structured representation of database schemas using special markers such as #DB_ID, #Table, and #Foreign keys. Each table lists its name, and columns are described with basic_info including name, description, primary‑key flag, and example values.

Example value rules:

For date/time types (DATE, TIME, DATETIME, TIMESTAMP) keep a single example.

For floating‑point type (REAL) keep a two‑decimal example.

For integer type keep one example.

For text type keep examples based on token length; if fewer than 10 tokens, provide multiple examples.

These rules are derived from analysis of erroneous samples, showing that richer examples for date, float, and integer have limited impact, while diverse examples for text improve retrieval.

Chain‑of‑thought prompting is used: the full database information is provided, followed by the user query and external knowledge, then the model is instructed to generate reasoning steps inside <think> … </think> and the final SQL inside <answer> … </answer>.

3. Training Method

Iterative DPO

Preference optimization (DPO) aligns pretrained models with human preferences more effectively than supervised fine‑tuning. We apply an iterative DPO process: after each iteration, the updated model generates new preference pairs from sampled chain‑of‑thought steps and answers, which are used for the next round of training.

Performance improves across iterations, peaking at the third stage.

Model                Avg CoT tokens   Exec Acc
Qwen2.5-Coder-32B    334              63.69%
iterative stage1     377              65.78%
iterative stage2     377              67.08%
iterative stage3     380              67.60%
iterative stage4     384              67.40%

Hyper‑parameter Scan

We vary the DPO loss weight β from 0.1 to 0.6 (other settings fixed). The highest execution accuracy is achieved at β = 0.5.

4. Self‑Consistency

Self‑consistency generates multiple candidate answers for the same question and selects the best via voting, rather than relying on a single generation.

Two voting strategies:

Hard voting: based on binary correctness of each answer.

Soft voting: considers similarity between answers, allowing semantically equivalent or near‑correct results to contribute.

Soft voting consistently outperforms hard voting, yielding over 1% accuracy gain.

Model               No SC   Hard Vote   Soft Vote
iterative stage3   67.60% 67.93%      68.97%
iterative stage4   67.40% 67.40%      68.45%

5. Future Directions

Potential work includes constructing higher‑quality data from the 2.5M‑size SynSQL‑2.5M dataset, exploring alternative training methods such as GRPO, refining label quality with LLM‑as‑Judge, and evaluating on additional benchmarks like Spider, ScienceBenchmark, and EHRSQL.