Artificial Intelligence 13 min read

Unlock Data Power with DB‑GPT: An Open‑Source AI Framework for Data Development

DB‑GPT is an open‑source AI‑native data application framework that unifies multi‑model management, RAG, agents, and workflow orchestration to simplify building large‑model‑driven data solutions, offering features such as private Q&A, multi‑source analytics, automated fine‑tuning, and robust privacy security.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Unlock Data Power with DB‑GPT: An Open‑Source AI Framework for Data Development

1. DB‑GPT Basic Concept

DB‑GPT is an open‑source AI‑native data application development framework (AI Native Data App Development framework with AWEL and Agents). Its purpose is to build infrastructure for the large‑model domain, providing multi‑model management (SMMF), Text2SQL optimization, RAG framework, multi‑agents collaboration, and AWEL workflow orchestration, making database‑centric large‑model applications simpler and more convenient.

In the Data 3.0 era, enterprises and developers can build custom applications with far less code.

2. Overall Architecture and Core Features

2.1 Overall Architecture

The diagram shows DB‑GPT’s framework: knowledge (RAG) on the left, tools (Agents) on the right, multi‑model management (SMMF) in the middle, a vector store for model memory, adapters for various data sources, and a generic interaction layer on top.

2.2 Core Features

Private Q&A, data processing & RAG – supports custom knowledge‑base construction via multi‑file uploads, plugins, etc., with unified vector storage and retrieval for massive structured and unstructured data.

Multi‑data source & GBI – natural‑language interaction with Excel, databases, data warehouses, and generation of analytical reports.

Multi‑model support – dozens of LLMs including LLaMA/LLaMA2, Baichuan, ChatGLM, Wenxin, Tongyi, Zhipu, Spark, etc.

Automated fine‑tuning – lightweight pipeline for Text2SQL fine‑tuning using LoRA/QLoRA/PTuning, making the process as simple as a production line.

Data‑driven multi‑Agents plugins – supports custom plugin execution, native Auto‑GPT plugin model, and Agent Protocol standard.

Privacy security – ensures data privacy through private LLM deployment and proxy desensitization techniques.

3. System Architecture

DB‑GPT’s system architecture consists of several layers:

Visualization layer: dialogue, interaction, chart display, visual orchestration.

Application layer: builds applications such as GBI, ChatDB, ChatData, ChatExcel on top of core capabilities.

Service layer: external services like LLMServer, APIServer, RAGServer, dbgptserver.

Core module layer: includes SMMF, RAGs, Agents.

Protocol layer: AWEL (Agentic Workflow Expression Language) for agent workflow orchestration.

Training layer: focuses on fine‑tuning for Text2SQL, Text2DSL, Text2API.

Runtime environment: supports deployment on Kubernetes, Ray, AWS, Alibaba Cloud, and private clouds.

4. Core Modules Overview

4.1 Multi‑Model Management (SMMF)

To simplify model adaptation, DB‑GPT introduces a service‑oriented multi‑model management framework (SMMF), essentially making model services serverless. The hierarchy includes service & application layer, model deployment framework layer (APIServer, Model Handle, Model Controller, Model Worker), inference framework layer (vLLM, llama.cpp, FastChat), and the underlying deployment environment (Kubernetes, Ray, cloud platforms).

4.2 Retrieval‑Augmented Generation (MS‑RAG)

MS‑RAG (Multi‑Source Enhanced Retrieval‑Augmented Generation) provides multi‑document, multi‑source retrieval capabilities, covering the full pipeline of knowledge construction, retrieval, and answer generation.

4.3 Data‑Driven Agents

Data‑Driven Multi‑Agents provide a production‑grade framework for building agents that make data‑driven decisions and can be orchestrated within controlled workflows. Core agent modules include Memory, Profile, Planning, and Action, supporting single agents, auto‑plan agents, and AWEL‑orchestrated collaborations.

4.4 Fine‑Tuning

DB‑GPT supports Text2SQL and Text2API (DSL) fine‑tuning to improve model accuracy. The project provides a dedicated Python package for Text2SQL fine‑tuning, which can be installed via PyPI.

<code>from dbgpt_hub.data_process import preprocess_sft_data
from dbgpt_hub.train import start_sft
from dbgpt_hub.predict import start_predict
from dbgpt_hub.eval import start_evaluate

data_folder = "dbgpt_hub/data"
data_info = [
    {
        "data_source": "spider",
        "train_file": ["train_spider.json", "train_others.json"],
        "dev_file": ["dev.json"],
        "tables_file": "tables.json",
        "db_id_name": "db_id",
        "is_multiple_turn": False,
        "train_output": "spider_train.json",
        "dev_output": "spider_dev.json",
    }
]

train_args = {
    "model_name_or_path": "codellama/CodeLlama-13b-Instruct-hf",
    "do_train": True,
    "dataset": "example_text2sql_train",
    "max_source_length": 2048,
    "max_target_length": 512,
    "finetuning_type": "lora",
    "lora_target": "q_proj,v_proj",
    "template": "llama2",
    "lora_rank": 64,
    "lora_alpha": 32,
    "output_dir": "dbgpt_hub/output/adapter/CodeLlama-13b-sql-lora",
    "overwrite_cache": True,
    "overwrite_output_dir": True,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "lr_scheduler_type": "cosine_with_restarts",
    "logging_steps": 50,
    "save_steps": 2000,
    "learning_rate": 2e-4,
    "num_train_epochs": 8,
    "plot_loss": True,
    "bf16": True,
}

predict_args = {
    "model_name_or_path": "codellama/CodeLlama-13b-Instruct-hf",
    "template": "llama2",
    "finetuning_type": "lora",
    "checkpoint_dir": "dbgpt_hub/output/adapter/CodeLlama-13b-sql-lora",
    "predict_file_path": "dbgpt_hub/data/eval_data/dev_sql.json",
    "predict_out_dir": "dbgpt_hub/output/",
    "predicted_out_filename": "pred_sql.sql",
}

evaluate_args = {
    "input": "./dbgpt_hub/output/pred/pred_sql_dev_skeleton.sql",
    "gold": "./dbgpt_hub/data/eval_data/gold.txt",
    "gold_natsql": "./dbgpt_hub/data/eval_data/gold_natsql2sql.txt",
    "db": "./dbgpt_hub/data/spider/database",
    "table": "./dbgpt_hub/data/eval_data/tables.json",
    "table_natsql": "./dbgpt_hub/data/eval_data/tables_for_natsql2sql.json",
    "etype": "exec",
    "plug_value": True,
    "keep_distict": False,
    "progress_bar_for_each_datapoint": False,
    "natsql": False,
}

preprocess_sft_data(data_folder=data_folder, data_info=data_info)
start_sft(train_args)
start_predict(predict_args)
start_evaluate(evaluate_args)
</code>

4.5 AWEL (Agentic Workflow Expression Language)

AWEL is a workflow expression language designed specifically for large‑model application development. It abstracts low‑level model and environment details, offering a layered API architecture consisting of the Operator layer (basic primitives like retrieval, vectorization, prompting), the AgentFrame layer (chaining operators, supporting distributed execution), and the DSL layer (structured language for deterministic agent programming).

5. Conclusion

Applying large models to the data domain is not a simple matter of feeding data into a model; it requires integrating data management, model technology, data security, inference rules, and feedback loops into a complex system that must be continuously optimized and refined.

AIlarge language modelsRAGFine‑tuningopen-sourceAgentsData Framework
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.