How RAG Turns Natural Language Queries into Accurate SQL for Data Platforms
This article explains how Retrieval‑Augmented Generation (RAG) combines vector databases with large language models to let non‑technical users ask natural‑language questions and receive precise SQL statements, detailing the workflow, architecture, chunking methods, performance gains, and remaining challenges.
Background
In the era of digital transformation, data has become a core asset for enterprises. Bilibili's big‑member center relies on a data intelligence platform to process massive member data, support service optimization, and meet diverse business insight needs.
Problem: Traditional SQL Barriers
Business users without SQL expertise struggle to retrieve data quickly. For example, an operations analyst may need to know the OGV consumption of newly subscribed members within a specific period but cannot write the required complex SQL.
LLM Limitations
Although large language models (LLMs) excel at natural language processing, direct SQL generation suffers from hallucinations: incorrect field names, irrelevant table joins, or entirely fabricated schema elements, leading to inaccurate or unexecutable queries.
RAG Solution
Retrieval‑Augmented Generation (RAG) mitigates these issues by integrating a vector database with the LLM. External knowledge—data models, business rules, historical query examples—is stored as high‑dimensional embeddings, enabling semantic retrieval of relevant context before the LLM generates SQL.
RAG Workflow
The workflow consists of two main stages:
Document Pre‑processing & Vector Store Construction
Unstructured Loader : Parses various file formats (.docx, .xlsx, .pdf) into plain text streams.
Data Chunking : Splits text into semantically coherent chunks (CHUNKS) based on paragraphs or token limits.
Embedding : Uses a pretrained text‑embedding model to convert each chunk into a high‑dimensional vector.
Vector Database : Persists embeddings and builds an index for fast similarity search.
Question‑Answer Inference
Question Embedding : Converts the user's natural‑language query into a vector.
Retrieval : Searches the vector store for the most similar chunks, returning relevant context.
Prompt Construction : Combines the user question with retrieved context using a template (Instruction + Context + Question).
LLM Inference : The LLM processes the prompt, performs in‑context learning, and generates an accurate SQL statement.
Technical Architecture
The platform is organized into layered services:
Foundation Layer : Configuration center, logging & monitoring, permission management.
Data Layer : User‑side data (profiles, query history), business data (tables, knowledge base), and metadata (schema, constraints).
Storage Layer : Business relational databases, cache for frequent query results, and vector store for embeddings.
Service Layer : Query parsing service, retrieval service, LLM inference service, and SQL optimization service.
Application Layer : End‑user query UI, SDK integration for analytics tools, and downstream insight generation.
Chunking Strategies
Three common chunking methods are compared:
Fixed‑length splitting : Simple but may break semantic units.
Sliding‑window splitting : Overlaps chunks to preserve context at the cost of redundancy.
Semantic splitting : Uses punctuation and headings to keep logical units intact.
The article adopts a novel parent‑child chunking mode: parent chunks retain paragraph‑level context, while child chunks provide fine‑grained sentence‑level retrieval. This dual‑layer approach improves both relevance and completeness.
Evaluation
Testing shows a 30% accuracy boost in QA tasks and an overall query‑to‑result latency reduction from hours (manual SQL) to minutes (natural‑language input). The platform achieves an 85%+ accuracy rate after filtering queries within the knowledge‑base scope.
Challenges & Future Work
Current bottlenecks include retrieval latency in large vector stores and LLM inference time for complex queries. Planned optimizations involve more efficient vector indexing, sharding, caching, and model compression techniques such as distillation and quantization. Testing workflows are being automated with AI agents to reduce manual review effort.
Metadata Example (JSON)
{
"table_name": "xxx_order_a_d",
"description": "xxx订单表",
"partition_strategy": {"field": "log_date"},
"business_time": {"field": "pay_time", "format": "substr(pay_time,1,10)", "date_format": "yyyy-MM-dd"},
"fields": [
{"name": "order_id", "type": "varchar", "description": "业务订单唯一标识"},
{"name": "pid", "type": "bigint", "description": "套餐产品标识"},
{"name": "order_amt", "type": "double", "description": "订单金额(单位:元)", "is_metric": true, "aggregation": "SUM()"},
{"name": "pay_type", "type": "bigint", "description": "支付方式", "enum_mapping": {"1": "支付方式A", "2": "支付方式B", "3": "支付方式C"}}
]
}Conclusion
By leveraging RAG, the platform enables non‑technical business users to obtain accurate SQL results through natural language, dramatically shortening data‑to‑insight cycles and supporting both simple and complex analytical queries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
