Artificial Intelligence 9 min read

What Makes BiomedSQL and LogicCat the Toughest Text‑to‑SQL Benchmarks for LLMs?

BiomedSQL and LogicCat are two newly released Text‑to‑SQL datasets that challenge large language models with complex biomedical reasoning, multi‑step logical inference, and domain‑specific knowledge, offering detailed analyses of query types, scientific reasoning categories, and performance gaps that highlight current LLM limitations.

Aikesheng Open Source Community

Oct 29, 2025

What Makes BiomedSQL and LogicCat the Toughest Text‑to‑SQL Benchmarks for LLMs?

Preface

When researching AI4SQL/AI4DB/DB4AI products we discovered that improving SQL capabilities largely depends on high‑quality datasets. To create training and evaluation sets, data synthesis is required. To help developers quickly obtain resources, we have compiled a list of publicly available Text2SQL/NL2SQL datasets.

In the previous article we introduced the BIRD‑CRITIC dataset, which showcases the current DBA advantage over AI in SQL tasks. This article continues with two research‑oriented datasets: BiomedSQL and LogicCat .

BiomedSQL

BiomedSQL is a benchmark designed to evaluate large language models on scientific table reasoning tasks. It contains carefully selected question / SQL query / answer triples covering a variety of biomedical and SQL reasoning types, requiring models to apply implicit scientific standards rather than merely performing syntactic translation.

Dataset Analysis

The authors annotated all 68,000 question‑query pairs with SQL operation types and biomedical reasoning categories. Simple operations such as SELECT, ORDER BY, and arithmetic require shallow syntax parsing and are handled well by current LLMs. In contrast, multi‑condition filtering, threshold judgments, table joins, and similarity searches demand multi‑step logical composition, implicit pattern linking, or pattern‑based retrieval, posing greater challenges.

Scientific Reasoning Classification

Concrete Implicit Scientific Norms : Queries often involve domain‑specific concepts (e.g., “significant SNPs”) that imply statistical thresholds (e.g., p<5×10⁻⁸) not explicitly stated in the schema, requiring the model to infer them.

Integration of Missing Contextual Knowledge : Experts combine auxiliary data such as drug approval status or trial phase, even when the question does not mention them directly, demanding deeper contextual understanding.

Execution of Complex Multi‑hop Reasoning Workflows : Many questions require traversing multiple tables (e.g., “Which genes associated with Parkinson’s disease are most significantly expressed in which tissues?”), a multi‑step process that current LLMs struggle to translate into executable SQL.

Summary

BiomedSQL is the first large‑scale benchmark specifically targeting scientific reasoning in the biomedical domain for Text‑to‑SQL generation. Experiments show that even state‑of‑the‑art LLMs achieve far lower execution accuracy and answer quality than domain experts, highlighting critical limitations and providing a rigorous testbed for future research.

LogicCat

LogicCat is a challenging Text‑to‑SQL dataset designed to test complex reasoning abilities, including physical, arithmetic, commonsense, and hypothetical inference. It comprises 4,038 English questions with corresponding SQL queries and 12,114 step‑by‑step reasoning annotations across 45 different database domains.

Experiments show that the current best models achieve only 14.96% execution accuracy on this dataset, but performance rises to 33.96% when chain‑of‑thought annotations are introduced, underscoring its potential to drive reasoning‑driven SQL generation research.

Dataset Analysis

LogicCat includes four reasoning types:

Physical Knowledge Reasoning : Solves problems requiring physics formulas and unit‑aware calculations, testing multi‑step application of physical principles.

Mathematical Logic Reasoning : Involves arithmetic, logical, and analytical thinking with dense computation steps.

Commonsense Reasoning : Requires models to infer missing real‑world details to generate logically consistent SQL, aiding zero‑shot learning and interpretability.

Ideal Hypothetical Reasoning : Evaluates the ability to perform counterfactual and imaginative thinking in unseen scenarios, challenging models with complex conditional relationships.

Summary

Existing datasets such as Spider and BIRD provide challenges for SQL syntax parsing and execution but lack coverage of deep logical reasoning and domain‑specific knowledge. LogicCat addresses this gap by incorporating physical, mathematical, commonsense, and hypothetical reasoning tasks, thereby advancing Text‑to‑SQL models toward more sophisticated logical inference and knowledge integration.

Future Updates

We will continue to introduce high‑quality datasets; stay tuned.

References

[1] BIRD‑CRITIC Paper: https://arxiv.org/html/2506.18951v2

[2] BiomedSQL: https://github.com/NIH-CARD/biomedsql

[3] LogicCat: https://github.com/Ffunkytao/LogicCat

[4] BiomedSQL Paper: https://arxiv.org/html/2505.20321

[5] LogicCat Paper: https://arxiv.org/pdf/2505.18744

LLM Text-to-SQL dataset Logical Reasoning Biomedical

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.