How Xiaomi’s DataAgent Harness Secured Third Place in the Global Text‑to‑SQL BIRD Benchmark

It discusses Xiaomi DataAgent's third‑place ranking on the global BIRD Text‑to‑SQL benchmark, analyzes challenges such as model hallucination, lack of business knowledge, and complex multi‑table joins, and explains how a semantic harness addresses these problems to enable reliable enterprise data querying.

DataFunTalk
DataFunTalk
DataFunTalk
How Xiaomi’s DataAgent Harness Secured Third Place in the Global Text‑to‑SQL BIRD Benchmark

Background: BIRD Benchmark

The BIRD (Big Bench for Large‑scale Database Grounded Text‑to‑SQL Evaluation) benchmark is the most demanding public evaluation for Text‑to‑SQL systems. It contains 95 real databases spanning 37 vertical domains , with noisy data, missing values, and domain‑specific business knowledge. The primary metric is Execution Accuracy (EX) , which requires the generated SQL to execute correctly and produce the exact answer.

Observed Bottlenecks in Direct LLM‑to‑SQL

Hallucination (missing background knowledge) : Large language models (LLMs) have never seen company‑specific entities, rules, or column naming conventions. For example, the query “ How many employees does Xiaomi have? ’’ is answered by guessing generic numbers because the model does not know that the relevant table stores current employee counts.

Missing business rules : Queries that depend on domain constraints (e.g., “ITP department budget this year?”) require a WHERE department = 'ITP' clause. The model cannot infer that “ITP” maps to the “Basic Technology Platform” department, leading to empty results.

Complex root‑cause analysis : Multi‑step analytical questions such as “Which four‑level teams caused the December profit‑margin swing and by how much?” force the model to decompose the problem into many sequential SQL statements. A naïve LLM would issue one API call per step , inflating latency by >3× and compounding error rates.

Table‑level complexity : Real‑world schemas contain hundreds of tables, inconsistent naming, no foreign‑key constraints, and mixed‑case column values (e.g., PHONE vs phone). This leads to silent join failures or duplicated rows, as illustrated by the “Samsung employee” example where the join returns zero rows without error.

Stability across model upgrades : Execution success does not guarantee answer correctness, and regression testing must be performed manually after each model update.

Permission and compliance : Enterprise data often requires row‑level and table‑level isolation (e.g., non‑executives cannot query executive‑only financial tables). Prompt engineering alone cannot enforce these security boundaries.

DataAgent Harness: Architectural Remedy

The DataAgent harness adds a semantic layer that injects domain knowledge, normalizes schema names, and enforces correct join semantics. Its workflow is:

NL query → Semantic parser (adds business constraints) → Controlled SQL sequence → Trusted execution engine

Key capabilities:

Injects company‑specific entities and business rules (e.g., mapping ITP to the correct department ID).

Normalizes column names and resolves case‑sensitivity, preventing silent join failures.

Transforms a high‑level question into a series of validated SQL statements, handling missing values and permission checks in a single API call.

Centralizes execution in a trusted engine, so latency is reduced and error propagation is eliminated.

Empirically, the base LLM alone ranks in the 30‑40 range on BIRD, whereas the combined Model + Harness (Xiaomi DataAgent) achieved a global third place ranking.

Concrete Failure Cases Demonstrated

Entity hallucination : “ How many employees does Xiaomi have? ’’ returns a guessed number because the model does not recognize the employee table.

Incorrect enum handling : Querying “Samsung employees” returns zero rows because the model does not know that the table stores the value Samsung as a string.

Business rule omission : “ Last week’s lost‑order volume? ’’ requires a filter

WHERE loan_completion_date IS NOT NULL AND is_self_operated = 0

. The model omits these predicates, producing a mismatched answer.

Division‑type metrics : Calculating profit‑margin changes (profit/ revenue) cannot be expressed reliably without a dedicated analytical engine; the model produces plausible but mathematically incorrect results.

Large‑scale drill‑down : “Analyze which 100 k SKUs caused the November gross‑margin drop” cannot be fed entirely to the LLM. The model resorts to LIMIT 50, yielding only a partial view.

Join mismatches : Two tables store user_id as ID_123 and 123 respectively. The model generates A.user_id = B.user_id, which fails silently and returns zero rows.

Production‑Level Challenges

Even when SQL executes successfully, the semantic correctness depends on business knowledge that only humans can verify. This creates three production‑grade problems:

Online accuracy can only be measured by costly manual labeling.

Model upgrades may regress performance; there is no automated regression suite.

When a bad case is discovered, fixing it requires manual prompt engineering and developer intervention, leading to slow iteration cycles.

Compliance and Security

Enterprise environments demand:

Table‑level isolation (e.g., regular staff cannot access executive‑only financial tables).

Row‑level isolation (different departments see different subsets of the same table).

Robust injection protection to prevent malicious queries from bypassing permission checks.

These requirements cannot be satisfied by prompt tuning alone; they must be enforced by the system layer of the harness.

Future Roadmap

Low‑cost evaluation‑iteration loop : Build a closed‑loop pipeline that automatically generates test cases, runs them through the harness, and feeds back failures for continuous improvement.

Self‑evolving semantic layer : Leverage user interactions to incrementally enrich the domain knowledge base, so the harness becomes more aware of business concepts over time.

Scalable governance : Integrate fine‑grained permission checks and compliance validation into the execution engine, removing reliance on ad‑hoc prompts.

The current third‑place result validates the approach; the next steps aim to turn the harness into a production‑grade, self‑learning component that bridges the gap between generic LLM capabilities and reliable enterprise data intelligence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMText-to-SQLenterprise AIBIRD benchmarkDataAgentSemantic harness
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.