Artificial Intelligence 13 min read

How Tencent Games Boosted AI‑Generated SQL Accuracy to 89% with a Lakehouse Architecture

Tencent Games tackled the low accuracy of AI‑generated SQL in production by combining large language models with a StarRocks lake‑warehouse, introducing a semantic layer, async materialized views, and an agent‑based multi‑intelligence framework, ultimately raising one‑shot SQL correctness to 89% and cutting delivery time from 2 hours to 0.33 hours.

StarRocks

May 14, 2024

How Tencent Games Boosted AI‑Generated SQL Accuracy to 89% with a Lakehouse Architecture

Challenges of AI‑Generated SQL in Production

In a large‑scale game data platform processing >30,000 data‑extraction requests per year, manual SQL authoring is a major bottleneck. Large language models (LLMs) achieve >90% accuracy on public benchmarks such as Spider, but their one‑shot accuracy drops to ~65% on realistic workloads (e.g., the BIRD dataset). The degradation is caused by incomplete or ambiguous data‑asset information, which leads to mis‑understanding of business intent.

AI + Lakehouse Data‑Asset Paradigm

The solution replaces a traditional data‑warehouse with a lake‑warehouse architecture built on StarRocks. A semantic layer substitutes the physical ADS/DWS layers, allowing rapid configuration of new datasets, metrics, and dimensions. Real‑time ingestion pipelines feed data directly into the lake, while transparent query acceleration satisfies low‑latency requirements.

Semantic‑Layer Asset Types

Existing dashboards (≈70% of assets) – the LLM can directly use predefined metrics.

Feature‑only assets (≈20%) – the LLM generates missing metrics from available feature columns.

Purely new assets (≈10%) – users provide business context; the LLM creates both features and metrics.

This taxonomy gives the LLM a structured view of the data catalog, reducing ambiguity during prompt construction.

Cost‑Efficiency Optimization Engine

To handle high‑frequency, slow‑running queries, an engine leverages StarRocks’ asynchronous materialized view (MV) rewriting and multi‑refresh mechanisms:

Identify hot SQL statements (high query count, long execution time).

Group statements with identical logical plans into a single ASYNC MATERIALIZED VIEW definition.

StarRocks automatically rewrites incoming queries to read from the MV when the view is up‑to‑date.

Multiple refresh strategies (incremental, full, on‑schedule) keep the MV synchronized with source tables while minimizing I/O.

Each MV becomes a “quality asset” that can be reused across requests, eliminating redundant computation and lowering CPU/memory consumption.

Agent‑Based Multi‑Intelligence Framework

A proprietary local model enriches the generic LLM with enterprise‑specific knowledge:

Knowledge‑graph integration supplies entity relationships (e.g., game‑level hierarchies, user‑segment definitions).

Semantic understanding modules map natural‑language intents to catalog objects.

Retrieval‑ranking pipelines select the most relevant assets before prompting the LLM.

The overall workflow is orchestrated by an agent architecture that decomposes a data‑request job into discrete tasks:

Intent Clarification Agent – interacts with the user to refine ambiguous requirements.

SQL Generation Agent – invokes the LLM (augmented with retrieved assets) to produce a candidate SQL statement.

Human‑In‑The‑Loop Review – optional step where data engineers validate or edit the generated SQL.

Execution Agent – submits the final SQL to StarRocks, automatically applying MV rewriting.

This hybrid human‑AI approach ensures high fidelity while preserving efficiency.

uDataChat: Conversational Data Assistant

The integrated system, named uDataChat , provides a chat interface where users describe data needs in natural language. The backend pipeline performs intent clarification, asset retrieval, LLM‑based SQL synthesis, and execution via the optimization engine. The result is returned to the user as a query result set or a downloadable file.

Results and Impact

Average data‑delivery latency reduced from 2 hours (manual) to 0.33 hours (self‑service).

User self‑service delivery rate increased by 70%.

Asset reuse rate grew from 70% to 77% thanks to MV‑based sharing.

One‑shot SQL accuracy improved from 51% at launch to 89% after engineering, meeting production thresholds.

Future Directions

Planned extensions include:

Integrating the AI + lakehouse stack with additional BI and analytics tools.

Further LLM engineering such as fine‑tuning on domain‑specific query logs.

Expanding the knowledge‑graph to cover more game‑specific entities and cross‑business metrics.

data engineering AI LLM StarRocks lakehouse SQL Generation Tencent Games

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Challenges of AI‑Generated SQL in Production

AI + Lakehouse Data‑Asset Paradigm

Semantic‑Layer Asset Types

Cost‑Efficiency Optimization Engine

Agent‑Based Multi‑Intelligence Framework

uDataChat: Conversational Data Assistant

Results and Impact

Future Directions

StarRocks

How this landed with the community

Was this worth your time?

0 Comments

AI + Lakehouse Data‑Asset Paradigm