Artificial Intelligence 7 min read

Leveraging Large Language Models for Text-to-SQL: Prompt Design and End-to-End Pipeline

This article explains how large language models can be used to convert natural language queries into SQL statements, describes two main approaches—direct generation and fine‑tuned open‑source models—details prompt engineering techniques, and outlines an end‑to‑end pipeline that executes the generated SQL and summarizes results.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Leveraging Large Language Models for Text-to-SQL: Prompt Design and End-to-End Pipeline

Currently, Text2SQL is a popular application of large models that helps users quickly generate the desired SQL queries. Simple queries are usually correct, but complex logic often requires users to modify the generated SQL; Text2SQL mainly reduces development time and cost.

Text‑to‑SQL (Text2SQL) converts natural language text into structured query language (SQL) and belongs to the semantic parsing sub‑task of natural language processing.

The goal is to break the barrier between humans and structured data, allowing ordinary users to describe complex database queries in natural language and obtain the results.

Background

There are two main approaches to Text2SQL:

Using existing large models (e.g., ChatGPT, GPT‑4) to generate SQL directly, which may expose confidential company data to the model.

Fine‑tuning open‑source large models such as chatglm2‑6b, which is the approach we are currently using; many open datasets are also available.

Text2SQL Usage

The application consists of two stages. In the first stage, an LLM understands the user request and generates a structured SQL statement. In the second stage, the generated SQL is executed against a database, the results are returned, and the LLM summarizes and analyzes them.

Stage 1: SQL Generation

We improve LLM understanding by (1) building a schema‑information table and generating embeddings for tables, and (2) carefully constructing prompts that embed the schema information.

1. Schema & Embedding: For each table we create a detailed schema description and pre‑compute embeddings using chatglm2‑6b; the top‑1 relevant table is retrieved via embedding similarity.

2. Prompt Construction: The prompt begins by defining the LLM’s task (SQL generation) and incorporates the retrieved table schema. Additional in‑context examples can be added to reinforce the desired behavior.

Effective prompt engineering is crucial; a well‑crafted prompt can determine half of the success.

Stage 2: SQL Execution and Result Summarization

After generating the SQL, we connect to the corporate CK database via an API, execute the query, and obtain structured results, which the LLM then summarizes and provides analytical guidance.

Result

We have built a pipeline that uses LLMs to generate SQL, automatically runs the SQL, and produces results. Initially we used GPT models to validate the pipeline and generate training data for fine‑tuning chatglm2‑6b; future work includes deeper data analysis and providing user‑focused recommendations.

Overall Pipeline

LLMPrompt EngineeringText-to-SQLChatGLMdata privacySQL generation
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.