Artificial Intelligence 9 min read

Can LLMs Fix Real-World SQL Bugs? Inside the BIRD-CRITIC Benchmark

This article introduces the BIRD-CRITIC benchmark, a comprehensive SQL diagnostic dataset spanning multiple dialects, evaluates large language models' ability to repair real-world database queries, and discusses its design, multi‑dialect support, data quality processes, and experimental results.

Aikesheng Open Source Community

Oct 13, 2025

Can LLMs Fix Real-World SQL Bugs? Inside the BIRD-CRITIC Benchmark

Preface

When researching AI4SQL/AI4DB/DB4AI products, we discovered that improving SQL capabilities heavily depends on high‑quality datasets and synthetic data for training and evaluation. To help developers quickly access resources, we have compiled a list of publicly available Text2SQL/NL2SQL datasets.

In the first issue we recommended 23 datasets. This issue introduces the BIRD‑CRITIC dataset from the Brid team.

Challenging SQL Diagnostic Benchmark

What is BIRD‑CRITIC?

BIRD‑CRITIC (also known as SWE‑SQL) is the first SQL diagnostic benchmark, created to answer whether large language models (LLMs) can fix user problems in real‑world database applications.

The benchmark contains 600 development tasks and 200 out‑of‑distribution (OOD) test tasks, covering four mainstream open‑source SQL dialects (MySQL, PostgreSQL, SQL Server, Oracle). It goes beyond simple SELECT queries, encompassing a wide range of SQL operations and includes an execution‑based evaluation environment for strict and efficient validation.

Dataset versions:

bird-critic-1.0-flash-exp : lightweight version with 200 PostgreSQL instances.

bird-critic-1.0-open : full version covering all four dialects with 600 instances.

bird-critic-1.0-postgresql : PostgreSQL‑only version with 600 instances.

bird-critic-1.0-bigquery : full version containing 200 tasks in BigQuery.

Can LLMs Solve Real Database User Problems?

On July 9 2025, Bird released human performance scores for the dataset, showing a large margin over LLMs.

Three leaderboards display scores from human evaluators (database experts) who could use standard tools but not AI assistants. When another team was allowed to use AI tools (ChatGPT, Claude, Gemini), performance improved dramatically: 83.33 for bird‑critic‑1.0‑open, 87.90 for bird‑critic‑1.0‑postgresql, and 90.00 for bird‑critic‑1.0‑flash‑exp, demonstrating the huge potential of human‑AI collaboration in SQL problem solving.

Dataset Characteristics

BIRD‑CRITIC’s core data consists of three elements: the problematic SQL statement, a natural‑language description of the issue, and the database schema, focusing on evaluating a model’s ability to repair erroneous SQL based on user descriptions.

Example A

In a financial database, I want to forward‑fill all nullable columns using a function that takes a table name, an ID column, and a row‑number column. For the trans table, I need to group by account_id and order by date. The original function had syntax errors; I need a corrected version that works for any table with nullable columns.

Example B

In the card_type table, hierarchical data links each row to its parent via parent_uuid. I need a recursive query that returns a tree‑structured output where each parent lists its direct children as an array of UUIDs, rather than a flat list of parent UUIDs.

Multi‑Dialect Compatibility

Based on the BIRD‑SQL development set, the team migrated the original SQLite schema to PostgreSQL, MySQL, SQL Server, and Oracle using Navicat, then manually verified structural correctness, data consistency across databases, and preservation of original data integrity.

Data Quality Assurance

Two‑stage annotation: a base group of 10 vetted SQL experts and an expert arbitration group of 3 senior database scientists. A three‑stage cross‑validation process includes expanded test cases, red‑team error injection, and final expert adjudication.

Conclusion

State‑of‑the‑art models (e.g., O3‑Mini) achieve only 38.87 % success on PostgreSQL tasks, highlighting the benchmark’s difficulty.

Inference‑based LLMs outperform generic models by 6.13 % on average for PostgreSQL and 8.03 % across dialects; the BIRD‑FIXER fine‑tuning architecture can boost smaller models beyond top‑tier LLMs.

Query‑type analysis shows DML/DQL tasks remain the hardest for all models.

The benchmark establishes a new realistic standard for evaluating SQL diagnostic capabilities of LLMs.

Future Updates

We will continue to introduce high‑quality datasets—stay tuned.

SQL database LLM benchmark dataset Text2SQL

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.