Unlock MindsDB Knowledge Base: Build RAG Pipelines and Data Agents with SQL
This article walks through MindsDB’s Knowledge Base feature, showing how to map data sources, create vector indexes, perform semantic search, combine multiple sources with SQL joins, automate updates via jobs, and construct powerful RAG pipelines and Data Agents for AI‑driven query answering.
Knowledge Base and Hybrid Retrieval
MindsDB lets you expose any data source—relational tables, files, or vector stores—as a virtual SQL table. By creating a KNOWLEDGE_BASE, embeddings are generated automatically, stored in a vector database, and can be queried with plain SQL.
create table my_chromadb.collection_name (
select embedding as embeddings, f.content, f.metadata
from files.table1 f join my_emb_openai m
)Typical workflow:
Map the data source (RDBMS or file) to expose raw content.
Create a vector store to hold embeddings for semantic search.
Define an embedding model that converts the content into vectors.
Build a vector index linking source rows to their embeddings.
Run semantic retrieval against the vector store.
Optional re‑ranking with a Rerank model.
Creating a Knowledge Base
A Knowledge Base bundles the above steps into a single virtual table. Example:
create KNOWLEDGE_BASE my_sales_question_kb
using (
content_columns = ['content'],
preprocessing = {
"text_chunking_config": {"chunk_size": 1000, "chunk_overlap": 100}
}
)Key parameters include the embedding/Rerank models, column mappings ( id_, metadata_, content_) and preprocessing options such as chunk size.
Inserting Data and Semantic Search
Load data with a simple INSERT INTO; MindsDB creates the embedding model, vector store, and populates the index automatically.
insert into my_sales_question_kb
select * from files.sales_questionsSemantic queries target the content column and can filter by relevance, limit, or metadata:
select * from my_sales_question_kb
where content = '客户出一个超低价时怎么办?'
and relevance >= 0.7
limit 5Cross‑Source Hybrid Retrieval
Because a Knowledge Base behaves like a regular table, it can be joined with other relational tables to combine vector‑based and scalar conditions.
SELECT p.product_id, p.product_name, p.description
FROM my_kb AS k
JOIN my_postgres.products AS p ON k.id = p.product_id
WHERE k.content = '受女生欢迎的手机'
and relevance >= 0.5
and p.brand IN ('苹果','华为')
and p.price < 10000
limit 5Automatic Knowledge Base Updates
MindsDB can keep a Knowledge Base in sync with its source using the CREATE JOB mechanism (similar to Oracle jobs or MySQL event schedulers). Jobs can synchronize:
Different data sources.
Data source ↔ Knowledge Base.
Data source ↔ Model (re‑training).
For near‑real‑time sync, a Trigger can be used where the underlying database supports it.
Data Agents (SQL Agents)
A Data Agent is a LangChain‑based service that combines a Knowledge Base, relational tables, and an LLM to answer natural‑language questions. It is defined with CREATE AGENT and configured with the target model, API key, knowledge bases, tables, and a prompt template.
CREATE AGENT my_agent
CONFIGURE
engine = 'openai',
model = 'gpt-4o',
openai_api_key = 'YOUR_OPENAI_API_KEY'
USING
knowledge_bases = ['my_kb'],
tables = ['my_postgres.products','my_postgres.customers','my_postgres.orders','my_postgres.order_items'],
prompt_template = '- 知识库 my_kb: 产品描述向量索引
- 表 my_postgres.products: 结构化产品信息
- 表 my_postgres.customers: 客户信息
- 表 my_postgres.orders & order_items: 订单历史'Query the agent with SQL:
SELECT answer FROM my_sales_question_agent
WHERE question = 'YOUR_QUESTION';The agent first performs a semantic lookup in the Knowledge Base, then executes any required SQL joins, and finally feeds the retrieved context to the LLM to generate an answer.
Practical Recommendations
MindsDB acts as an AI‑enabled data middleware: it avoids data movement, lowers the barrier to using LLMs via SQL, and provides a unified interface for heterogeneous sources. It is well‑suited for enterprise environments with multiple legacy systems where a virtual AI data layer is needed.
The Knowledge Base simplifies unstructured data indexing, the Job/Trigger system automates synchronization, and Data Agents enable low‑code conversational interfaces for business users.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
