How RAG Turns Natural Language Queries into Accurate SQL for Data Platforms

This article explains how Retrieval‑Augmented Generation (RAG) combines vector databases with large language models to let non‑technical users ask natural‑language questions and receive precise SQL statements, detailing the workflow, architecture, chunking methods, performance gains, and remaining challenges.

dbaplus Community
dbaplus Community
dbaplus Community
How RAG Turns Natural Language Queries into Accurate SQL for Data Platforms

Background

In the era of digital transformation, data has become a core asset for enterprises. Bilibili's big‑member center relies on a data intelligence platform to process massive member data, support service optimization, and meet diverse business insight needs.

Problem: Traditional SQL Barriers

Business users without SQL expertise struggle to retrieve data quickly. For example, an operations analyst may need to know the OGV consumption of newly subscribed members within a specific period but cannot write the required complex SQL.

LLM Limitations

Although large language models (LLMs) excel at natural language processing, direct SQL generation suffers from hallucinations: incorrect field names, irrelevant table joins, or entirely fabricated schema elements, leading to inaccurate or unexecutable queries.

RAG Solution

Retrieval‑Augmented Generation (RAG) mitigates these issues by integrating a vector database with the LLM. External knowledge—data models, business rules, historical query examples—is stored as high‑dimensional embeddings, enabling semantic retrieval of relevant context before the LLM generates SQL.

RAG Workflow

The workflow consists of two main stages:

Document Pre‑processing & Vector Store Construction

Unstructured Loader : Parses various file formats (.docx, .xlsx, .pdf) into plain text streams.

Data Chunking : Splits text into semantically coherent chunks (CHUNKS) based on paragraphs or token limits.

Embedding : Uses a pretrained text‑embedding model to convert each chunk into a high‑dimensional vector.

Vector Database : Persists embeddings and builds an index for fast similarity search.

Question‑Answer Inference

Question Embedding : Converts the user's natural‑language query into a vector.

Retrieval : Searches the vector store for the most similar chunks, returning relevant context.

Prompt Construction : Combines the user question with retrieved context using a template (Instruction + Context + Question).

LLM Inference : The LLM processes the prompt, performs in‑context learning, and generates an accurate SQL statement.

RAG workflow diagram
RAG workflow diagram

Technical Architecture

The platform is organized into layered services:

Foundation Layer : Configuration center, logging & monitoring, permission management.

Data Layer : User‑side data (profiles, query history), business data (tables, knowledge base), and metadata (schema, constraints).

Storage Layer : Business relational databases, cache for frequent query results, and vector store for embeddings.

Service Layer : Query parsing service, retrieval service, LLM inference service, and SQL optimization service.

Application Layer : End‑user query UI, SDK integration for analytics tools, and downstream insight generation.

Chunking Strategies

Three common chunking methods are compared:

Fixed‑length splitting : Simple but may break semantic units.

Sliding‑window splitting : Overlaps chunks to preserve context at the cost of redundancy.

Semantic splitting : Uses punctuation and headings to keep logical units intact.

The article adopts a novel parent‑child chunking mode: parent chunks retain paragraph‑level context, while child chunks provide fine‑grained sentence‑level retrieval. This dual‑layer approach improves both relevance and completeness.

Parent‑child chunking diagram
Parent‑child chunking diagram

Evaluation

Testing shows a 30% accuracy boost in QA tasks and an overall query‑to‑result latency reduction from hours (manual SQL) to minutes (natural‑language input). The platform achieves an 85%+ accuracy rate after filtering queries within the knowledge‑base scope.

Evaluation results
Evaluation results

Challenges & Future Work

Current bottlenecks include retrieval latency in large vector stores and LLM inference time for complex queries. Planned optimizations involve more efficient vector indexing, sharding, caching, and model compression techniques such as distillation and quantization. Testing workflows are being automated with AI agents to reduce manual review effort.

Metadata Example (JSON)

{
  "table_name": "xxx_order_a_d",
  "description": "xxx订单表",
  "partition_strategy": {"field": "log_date"},
  "business_time": {"field": "pay_time", "format": "substr(pay_time,1,10)", "date_format": "yyyy-MM-dd"},
  "fields": [
    {"name": "order_id", "type": "varchar", "description": "业务订单唯一标识"},
    {"name": "pid", "type": "bigint", "description": "套餐产品标识"},
    {"name": "order_amt", "type": "double", "description": "订单金额(单位:元)", "is_metric": true, "aggregation": "SUM()"},
    {"name": "pay_type", "type": "bigint", "description": "支付方式", "enum_mapping": {"1": "支付方式A", "2": "支付方式B", "3": "支付方式C"}}
  ]
}

Conclusion

By leveraging RAG, the platform enables non‑technical business users to obtain accurate SQL results through natural language, dramatically shortening data‑to‑insight cycles and supporting both simple and complex analytical queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMRAGvector databaseData Platformnatural language processingSQL generation
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.