Artificial Intelligence 16 min read

Design and Implementation of a Knowledge-Base Intelligent Q&A System for Database Operations Using Large Models

The paper details Baidu Intelligent Cloud’s design and deployment of a domain‑specific knowledge‑base Q&A system for database operations, combining prompt‑engineered LLMs with hybrid vector‑search using LangChain, BES vector store, and custom ingestion, addressing recall, token limits, and hallucination challenges across dashboard and IM bot interfaces.

Baidu Geek Talk

Feb 7, 2024

Design and Implementation of a Knowledge-Base Intelligent Q&A System for Database Operations Using Large Models

This article, originating from Baidu Intelligent Cloud's database operations team, presents a detailed case study of building a knowledge‑base intelligent Q&A system powered by large language models (LLMs). It covers the overall technical solution, module designs, key challenges, and real‑world deployment scenarios.

1. Background – With the rapid development of large models, AI is becoming pervasive. In the database operations domain, the goal is to combine expert systems with native AI to help engineers quickly retrieve knowledge and make accurate operational decisions.

Traditional knowledge‑base systems rely on static rules, keyword search, and predefined tags, requiring users to have professional expertise. Such approaches no longer meet the needs of complex, dynamic operational environments.

2. Architecture Design and Implementation

2.1 Technical方案选型 – The team evaluated three main approaches: fine‑tuning, prompt engineering, and hybrid search‑plus‑LLM. They chose a combination of prompt engineering and hybrid search, using a vector database as external memory and LangChain as the development framework.

2.2 Module Design

Knowledge Ingestion – Documents (PDF, CSV, Markdown, web pages) are loaded via LangChain, Selenium, or BeautifulSoup. Text is split into short chunks using RecursiveCharacterTextSplitter and SpacyTextSplitter, preserving context.

Text Vectorization – Initially tried open‑source embeddings (GanymedeNil, m3e) but settled on Baidu’s Wenxin embeddings for better performance.

Storage – Vectors and metadata are stored in a vector database. After testing ElasticSearch, BES, Milvus, and PGVector, BES (Baidu ElasticSearch) was selected for its HNSW implementation and resource efficiency.

Data Retrieval – User queries are vectorized, cached with GPTCache, and the top‑10 similar chunks are retrieved from the vector DB.

Result Integration – Retrieved chunks are assembled into a prompt (respecting token limits) and sent to the LLM. The LLM generates the final answer, which is also stored in MySQL for conversation history.

3. Technical Challenges and Solutions

3.1 Low recall in vector DB – Improved text splitting (using Spacy, hierarchical chunking, title compensation) and combined title+content vectorization to boost recall.

3.2 Token length limits – Adopted model selection (ERNIE‑Bot‑turbo vs ERNIE‑Bot), prompt pruning, and a MapReduce‑style multi‑turn LLM calling to handle long texts.

3.3 Stale knowledge and hallucinations – Integrated keyword extraction, official documentation search, and a hybrid pipeline that combines search results with LLM reasoning to mitigate outdated or fabricated answers.

4. Application Scenarios – The system is deployed in two ways: Database Chat (integrated into the DBSC dashboard) and IM bots (WeChat, Feishu, etc.) for quick knowledge access.

5. Summary – Building a domain‑specific knowledge‑base using vector databases and LLMs is feasible but still faces challenges such as retrieval accuracy and handling long contexts. Continuous model upgrades, better document management, and research on retrieval techniques are essential for future improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LangChain vector database Knowledge Base Large Language Model Database operations

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.