Unlock MindsDB Knowledge Base: Build RAG Pipelines and Data Agents with SQL

This article walks through MindsDB’s Knowledge Base feature, showing how to map data sources, create vector indexes, perform semantic search, combine multiple sources with SQL joins, automate updates via jobs, and construct powerful RAG pipelines and Data Agents for AI‑driven query answering.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
Unlock MindsDB Knowledge Base: Build RAG Pipelines and Data Agents with SQL

Knowledge Base and Hybrid Retrieval

MindsDB lets you expose any data source—relational tables, files, or vector stores—as a virtual SQL table. By creating a KNOWLEDGE_BASE, embeddings are generated automatically, stored in a vector database, and can be queried with plain SQL.

create table my_chromadb.collection_name (
  select embedding as embeddings, f.content, f.metadata
  from files.table1 f join my_emb_openai m
)

Typical workflow:

Map the data source (RDBMS or file) to expose raw content.

Create a vector store to hold embeddings for semantic search.

Define an embedding model that converts the content into vectors.

Build a vector index linking source rows to their embeddings.

Run semantic retrieval against the vector store.

Optional re‑ranking with a Rerank model.

Creating a Knowledge Base

A Knowledge Base bundles the above steps into a single virtual table. Example:

create KNOWLEDGE_BASE my_sales_question_kb
using (
  content_columns = ['content'],
  preprocessing = {
    "text_chunking_config": {"chunk_size": 1000, "chunk_overlap": 100}
  }
)

Key parameters include the embedding/Rerank models, column mappings ( id_, metadata_, content_) and preprocessing options such as chunk size.

Inserting Data and Semantic Search

Load data with a simple INSERT INTO; MindsDB creates the embedding model, vector store, and populates the index automatically.

insert into my_sales_question_kb
select * from files.sales_questions

Semantic queries target the content column and can filter by relevance, limit, or metadata:

select * from my_sales_question_kb
where content = '客户出一个超低价时怎么办?'
  and relevance >= 0.7
limit 5

Cross‑Source Hybrid Retrieval

Because a Knowledge Base behaves like a regular table, it can be joined with other relational tables to combine vector‑based and scalar conditions.

SELECT p.product_id, p.product_name, p.description
FROM my_kb AS k
JOIN my_postgres.products AS p ON k.id = p.product_id
WHERE k.content = '受女生欢迎的手机'
  and relevance >= 0.5
  and p.brand IN ('苹果','华为')
  and p.price < 10000
limit 5

Automatic Knowledge Base Updates

MindsDB can keep a Knowledge Base in sync with its source using the CREATE JOB mechanism (similar to Oracle jobs or MySQL event schedulers). Jobs can synchronize:

Different data sources.

Data source ↔ Knowledge Base.

Data source ↔ Model (re‑training).

For near‑real‑time sync, a Trigger can be used where the underlying database supports it.

Data Agents (SQL Agents)

A Data Agent is a LangChain‑based service that combines a Knowledge Base, relational tables, and an LLM to answer natural‑language questions. It is defined with CREATE AGENT and configured with the target model, API key, knowledge bases, tables, and a prompt template.

CREATE AGENT my_agent
CONFIGURE
  engine = 'openai',
  model = 'gpt-4o',
  openai_api_key = 'YOUR_OPENAI_API_KEY'
USING
  knowledge_bases = ['my_kb'],
  tables = ['my_postgres.products','my_postgres.customers','my_postgres.orders','my_postgres.order_items'],
  prompt_template = '- 知识库 my_kb: 产品描述向量索引
- 表 my_postgres.products: 结构化产品信息
- 表 my_postgres.customers: 客户信息
- 表 my_postgres.orders & order_items: 订单历史'

Query the agent with SQL:

SELECT answer FROM my_sales_question_agent
WHERE question = 'YOUR_QUESTION';

The agent first performs a semantic lookup in the Knowledge Base, then executes any required SQL joins, and finally feeds the retrieved context to the LLM to generate an answer.

Practical Recommendations

MindsDB acts as an AI‑enabled data middleware: it avoids data movement, lowers the barrier to using LLMs via SQL, and provides a unified interface for heterogeneous sources. It is well‑suited for enterprise environments with multiple legacy systems where a virtual AI data layer is needed.

Hybrid retrieval example
Hybrid retrieval example

The Knowledge Base simplifies unstructured data indexing, the Job/Trigger system automates synchronization, and Data Agents enable low‑code conversational interfaces for business users.

SQLAIRAGVector SearchMindsDBData Agent
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.