How Baidu Built an 80% Accurate AI-Powered Database Ops Knowledge Base
This article details Baidu Intelligent Cloud's database operations team’s end‑to‑end design of an AI‑driven knowledge‑base Q&A system, covering background, architecture, technical choices, module implementation, key challenges such as vector‑search recall and token limits, and real‑world deployment scenarios.
1 Background
With the rapid development of large models, AI technology is spreading to more scenarios. In the database operations field, the goal is to combine expert systems with native AI techniques to help DB ops engineers quickly obtain knowledge and make accurate decisions.
Traditional knowledge‑base systems rely on fixed rules, keyword search, and predefined tags, requiring users to have professional knowledge. This is insufficient for complex, changing operational environments, prompting the adoption of large models for knowledge provision and decision assistance.
AI applications in databases include knowledge‑base learning (Q&A, management), diagnosis and reasoning (log analysis, fault diagnosis), and work assistance (SQL generation, optimization). This article focuses on the design and implementation of a knowledge‑base intelligent Q&A system.
2 Architecture Design and Implementation
2.1 Technical Solution Selection
Large models can understand natural language and generate coherent answers, but several issues prevent direct use for domain‑specific Q&A:
Insufficient domain expertise leading to hallucinations and inaccurate answers.
Timeliness: training data is outdated and updating incurs high costs.
Security: models cannot directly access private internal documents for fine‑tuning.
To address these, three techniques are commonly used:
Fine‑tuning with domain data (resource‑intensive).
Prompt engineering to inject domain knowledge (limited by token length).
Combining traditional search with LLM processing for controlled, efficient retrieval.
The chosen approach combines Prompt engineering and traditional search: knowledge is stored in a vector database and accessed via LangChain, while prompts guide the LLM.
2.2 Module Design and Implementation
The overall workflow includes document loading, splitting, text/question vectorization, answer caching, and LLM generation.
2.2.1 Knowledge Ingestion
Document loading and parsing using LangChain loaders for PDF, CSV, Markdown, as well as Selenium and BeautifulSoup for internal web pages.
Text splitting into short chunks using RecursiveCharacterTextSplitter and SpacyTextSplitter, preserving headings and context.
Vectorization: initial experiments with open‑source embeddings (GanymedeNil, moka‑ai) performed poorly; Baidu’s Wenxin embeddings were selected for superior quality.
2.2.2 Data Retrieval
User query vectorization; cache hit reduces API cost.
Similarity search in the vector database (BES) retrieves top‑N relevant chunks (default 10).
2.2.3 Result Integration
Prompt generation combines retrieved chunks with the original question, respecting token limits.
LLM generates the final answer; conversation history is stored in MySQL for context.
3 Technical Challenges and Solutions
3.1 Low Recall in Vector Database
Initial recall was around 70%; the target is ≥85%. Improvements included precise Chinese‑aware text splitting, title compensation for large chunks, and combined title‑plus‑content vectorization, which significantly boosted recall.
3.2 Token Length Limitation
LLM context windows range from 2K to 100K tokens. Strategies used:
Discard low‑similarity chunks when the prompt exceeds limits.
Select models with larger windows (ERNIE‑Bot‑turbo 10K tokens) while preferring ERNIE‑Bot for better QA performance within 2K tokens.
Prompt compression attempts were ineffective for Chinese.
3.3 Stale Knowledge and Hallucination
Large models may provide outdated or fabricated answers. The solution combines search‑based retrieval (official documentation APIs) with LLM reasoning, ensuring up‑to‑date, accurate responses.
4 Application Scenarios
The system is deployed via two main channels:
Database Chat : a ChatGPT‑like interface with knowledge and user management, integrated into Baidu’s DBSC cockpit (public release planned).
IM Bot : integrates with collaboration tools (WeChat, Feishu, etc.) to provide instant knowledge retrieval within chat groups.
5 Conclusion
From an engineering perspective, building a domain‑specific knowledge base with vector databases and large AI models is straightforward, yet challenges remain in retrieval recall, long‑text handling, and maintaining up‑to‑date knowledge. Continuous model upgrades and data freshness are essential for future improvements.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
