Artificial Intelligence 14 min read

Boosting Text-to-SQL Accuracy: J‑Schema, Iterative DPO, and Self‑Consistency

This article surveys the evolution of Text-to-SQL, introduces the J‑Schema representation and chain-of-thought prompting, details an iterative DPO training pipeline with hyper‑parameter tuning, and demonstrates how self‑consistency voting boosts execution accuracy on the BIRD benchmark from 56.6% to 69.2%.

JD Retail Technology

Aug 14, 2025

Boosting Text-to-SQL Accuracy: J‑Schema, Iterative DPO, and Self‑Consistency

01 Text2SQL Challenges

Text‑to‑SQL (NL2SQL) converts natural‑language queries into executable SQL statements for relational databases, enabling non‑experts to retrieve information from complex data stores. The field has progressed through four stages: rule‑based methods, neural‑network models, pretrained language models (e.g., BERT, T5), and large language models (LLMs) that dominate current research.

Three major challenges remain: prompt optimization (guiding LLMs with clear reasoning and schema design), model training (enhancing base capabilities), and inference enhancement (stabilizing LLM outputs). Experiments are conducted on the BIRD benchmark, which contains over 12,751 unique question‑SQL pairs across 95 large databases and 37 domains.

02 Prompt & J‑Schema

To help LLMs understand database structure, the authors propose J‑Schema , a fully structured representation that uses special markers: #DB_ID for the database, #Table for tables, #Foreign keys for foreign‑key relations, and basic_info for column name, description, primary‑key flag, and example values.

Example‑value rules balance richness and context length: one example for DATE/TIME types, two decimal places for REAL, a single example for INTEGER, and multiple examples for TEXT when token length is under 10. Analysis of error samples shows that richer examples for textual columns significantly improve retrieval accuracy.

Question: What is the city location of the high school level school with Lunch Provision 2 whose lowest grade is 9 and the highest grade is 12 in the county of Merced? Correct Answer: SELECT T2.City FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T1.`NSLP Provision Status` = 'Lunch Provision 2' AND T2.County = 'Merced' AND T1.`Low Grade` = 9 AND T1.`High Grade` = 12 AND T2.EILCode = 'HS'; Wrong Answer: SELECT T2.City FROM main.frpm AS T1 INNER JOIN main.schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T1."NSLP Provision Status" = 'Breakfast Provision 2' AND T1."Low Grade" = '9' AND T1."High Grade" = '12' AND T1."County Name" = 'Merced'; Analysis: The model retrieved “Breakfast Provision 2” instead of “Lunch Provision 2” because the example value did not contain the correct provision string; adding the correct example resolves the error.

Question: How many accounts who choose issuance after transaction are staying in East Bohemia region? Correct Answer: SELECT COUNT(T2.account_id) FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id WHERE T1.A3 = 'East Bohemia' AND T2.frequency = 'POPLATEK PO OBRATU'; Wrong Answer: SELECT COUNT(T1.account_id) FROM main.account AS T1 INNER JOIN main.district AS T2 ON T1.district_id = T2.district_id WHERE T1.frequency = 'POPLATEK PO OBRATU' AND T2.A3 = 'east Bohemia'; Analysis: Inconsistent casing between the query (“East Bohemia”) and the example value (“east Bohemia”) caused a mismatch; providing a correctly cased example improves model performance.

03 Training Methods

Iterative DPO

Direct Preference Optimization (DPO) aligns pretrained LLMs with human preferences more effectively than supervised fine‑tuning. The authors apply DPO iteratively: each iteration samples multiple chain‑of‑thought reasoning steps and final answers, builds positive and negative preference pools, constructs preference pairs based on distance, and retrains the model. Performance improves across iterations until saturation.

Execution accuracy on BIRD after each iteration:

Qwen2.5‑Coder‑32B (baseline): 63.69% (average CoT length 334 tokens)

Iterative stage 1: 65.78% (CoT length 377)

Iterative stage 2: 67.08% (CoT length 377)

Iterative stage 3: 67.60% (CoT length 380) – highest accuracy

Iterative stage 4: 67.40% (CoT length 384)

Hyper‑parameter Scan

The DPO loss weight β was varied from 0.1 to 0.6 (other settings fixed, two epochs per run). The best execution accuracy (69.2%) was achieved at β = 0.5, which was used for all subsequent DPO iterations.

04 Self‑Consistency

Self‑consistency generates multiple candidate answers for the same question and selects the final answer via voting. Two voting schemes are explored:

Hard voting: binary correctness decides the vote.

Soft voting: similarity scores between candidates are used, allowing near‑correct or semantically equivalent answers to contribute.

Applying self‑consistency to the iterative models yields over 1% absolute improvement in execution accuracy; soft voting consistently outperforms hard voting.

Iterative stage 3: without self‑consistency 67.60%, hard voting 67.93%, soft voting 68.97%.

Iterative stage 4: without self‑consistency 67.40%, hard voting 67.40%, soft voting 68.45%.

05 Future Exploration

Planned directions include:

Data construction: Leveraging the large‑scale SynSQL‑2.5M dataset (2.5 M samples from 16 000+ databases) to select high‑quality subsets that complement BIRD.

Alternative training methods: Exploring GRPO, which uses group‑wise relative rewards without pre‑built preference pairs, and LLM‑as‑Judge to identify and remove mislabeled samples.

Broader evaluation: Testing on additional benchmarks such as Spider, ScienceBenchmark, and EHRSQL, and gradually deploying the approach in real‑world DataAgent scenarios for enterprise databases.