Unlock Data Power with DB‑GPT: An Open‑Source AI Framework for Data Development
DB‑GPT is an open‑source AI‑native data application framework that unifies multi‑model management, RAG, agents, and workflow orchestration to simplify building large‑model‑driven data solutions, offering features such as private Q&A, multi‑source analytics, automated fine‑tuning, and robust privacy security.
1. DB‑GPT Basic Concept
DB‑GPT is an open‑source AI‑native data application development framework (AI Native Data App Development framework with AWEL and Agents). Its purpose is to build infrastructure for the large‑model domain, providing multi‑model management (SMMF), Text2SQL optimization, RAG framework, multi‑agents collaboration, and AWEL workflow orchestration, making database‑centric large‑model applications simpler and more convenient.
In the Data 3.0 era, enterprises and developers can build custom applications with far less code.
2. Overall Architecture and Core Features
2.1 Overall Architecture
The diagram shows DB‑GPT’s framework: knowledge (RAG) on the left, tools (Agents) on the right, multi‑model management (SMMF) in the middle, a vector store for model memory, adapters for various data sources, and a generic interaction layer on top.
2.2 Core Features
Private Q&A, data processing & RAG – supports custom knowledge‑base construction via multi‑file uploads, plugins, etc., with unified vector storage and retrieval for massive structured and unstructured data.
Multi‑data source & GBI – natural‑language interaction with Excel, databases, data warehouses, and generation of analytical reports.
Multi‑model support – dozens of LLMs including LLaMA/LLaMA2, Baichuan, ChatGLM, Wenxin, Tongyi, Zhipu, Spark, etc.
Automated fine‑tuning – lightweight pipeline for Text2SQL fine‑tuning using LoRA/QLoRA/PTuning, making the process as simple as a production line.
Data‑driven multi‑Agents plugins – supports custom plugin execution, native Auto‑GPT plugin model, and Agent Protocol standard.
Privacy security – ensures data privacy through private LLM deployment and proxy desensitization techniques.
3. System Architecture
DB‑GPT’s system architecture consists of several layers:
Visualization layer: dialogue, interaction, chart display, visual orchestration.
Application layer: builds applications such as GBI, ChatDB, ChatData, ChatExcel on top of core capabilities.
Service layer: external services like LLMServer, APIServer, RAGServer, dbgptserver.
Core module layer: includes SMMF, RAGs, Agents.
Protocol layer: AWEL (Agentic Workflow Expression Language) for agent workflow orchestration.
Training layer: focuses on fine‑tuning for Text2SQL, Text2DSL, Text2API.
Runtime environment: supports deployment on Kubernetes, Ray, AWS, Alibaba Cloud, and private clouds.
4. Core Modules Overview
4.1 Multi‑Model Management (SMMF)
To simplify model adaptation, DB‑GPT introduces a service‑oriented multi‑model management framework (SMMF), essentially making model services serverless. The hierarchy includes service & application layer, model deployment framework layer (APIServer, Model Handle, Model Controller, Model Worker), inference framework layer (vLLM, llama.cpp, FastChat), and the underlying deployment environment (Kubernetes, Ray, cloud platforms).
4.2 Retrieval‑Augmented Generation (MS‑RAG)
MS‑RAG (Multi‑Source Enhanced Retrieval‑Augmented Generation) provides multi‑document, multi‑source retrieval capabilities, covering the full pipeline of knowledge construction, retrieval, and answer generation.
4.3 Data‑Driven Agents
Data‑Driven Multi‑Agents provide a production‑grade framework for building agents that make data‑driven decisions and can be orchestrated within controlled workflows. Core agent modules include Memory, Profile, Planning, and Action, supporting single agents, auto‑plan agents, and AWEL‑orchestrated collaborations.
4.4 Fine‑Tuning
DB‑GPT supports Text2SQL and Text2API (DSL) fine‑tuning to improve model accuracy. The project provides a dedicated Python package for Text2SQL fine‑tuning, which can be installed via PyPI.
<code>from dbgpt_hub.data_process import preprocess_sft_data
from dbgpt_hub.train import start_sft
from dbgpt_hub.predict import start_predict
from dbgpt_hub.eval import start_evaluate
data_folder = "dbgpt_hub/data"
data_info = [
{
"data_source": "spider",
"train_file": ["train_spider.json", "train_others.json"],
"dev_file": ["dev.json"],
"tables_file": "tables.json",
"db_id_name": "db_id",
"is_multiple_turn": False,
"train_output": "spider_train.json",
"dev_output": "spider_dev.json",
}
]
train_args = {
"model_name_or_path": "codellama/CodeLlama-13b-Instruct-hf",
"do_train": True,
"dataset": "example_text2sql_train",
"max_source_length": 2048,
"max_target_length": 512,
"finetuning_type": "lora",
"lora_target": "q_proj,v_proj",
"template": "llama2",
"lora_rank": 64,
"lora_alpha": 32,
"output_dir": "dbgpt_hub/output/adapter/CodeLlama-13b-sql-lora",
"overwrite_cache": True,
"overwrite_output_dir": True,
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 16,
"lr_scheduler_type": "cosine_with_restarts",
"logging_steps": 50,
"save_steps": 2000,
"learning_rate": 2e-4,
"num_train_epochs": 8,
"plot_loss": True,
"bf16": True,
}
predict_args = {
"model_name_or_path": "codellama/CodeLlama-13b-Instruct-hf",
"template": "llama2",
"finetuning_type": "lora",
"checkpoint_dir": "dbgpt_hub/output/adapter/CodeLlama-13b-sql-lora",
"predict_file_path": "dbgpt_hub/data/eval_data/dev_sql.json",
"predict_out_dir": "dbgpt_hub/output/",
"predicted_out_filename": "pred_sql.sql",
}
evaluate_args = {
"input": "./dbgpt_hub/output/pred/pred_sql_dev_skeleton.sql",
"gold": "./dbgpt_hub/data/eval_data/gold.txt",
"gold_natsql": "./dbgpt_hub/data/eval_data/gold_natsql2sql.txt",
"db": "./dbgpt_hub/data/spider/database",
"table": "./dbgpt_hub/data/eval_data/tables.json",
"table_natsql": "./dbgpt_hub/data/eval_data/tables_for_natsql2sql.json",
"etype": "exec",
"plug_value": True,
"keep_distict": False,
"progress_bar_for_each_datapoint": False,
"natsql": False,
}
preprocess_sft_data(data_folder=data_folder, data_info=data_info)
start_sft(train_args)
start_predict(predict_args)
start_evaluate(evaluate_args)
</code>4.5 AWEL (Agentic Workflow Expression Language)
AWEL is a workflow expression language designed specifically for large‑model application development. It abstracts low‑level model and environment details, offering a layered API architecture consisting of the Operator layer (basic primitives like retrieval, vectorization, prompting), the AgentFrame layer (chaining operators, supporting distributed execution), and the DSL layer (structured language for deterministic agent programming).
5. Conclusion
Applying large models to the data domain is not a simple matter of feeding data into a model; it requires integrating data management, model technology, data security, inference rules, and feedback loops into a complex system that must be continuously optimized and refined.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.