How to Build a High‑Performance Enterprise RAG System with Model Context Protocol (MCP)

This article presents a step‑by‑step guide for constructing a scalable enterprise Retrieval‑Augmented Generation (RAG) solution using the Model Context Protocol (MCP), covering architecture comparison, system design, Milvus‑backed knowledge store, Python client implementation, deployment scripts, code examples, and best‑practice recommendations.

ITPUB
ITPUB
ITPUB
How to Build a High‑Performance Enterprise RAG System with Model Context Protocol (MCP)

In the wave of enterprise digital transformation, managing internal knowledge assets efficiently has become a critical challenge. Large language models (LLM) enable Retrieval‑Augmented Generation (RAG) to bridge corporate knowledge with AI capabilities, but traditional RAG suffers from poor retrieval quality and difficult real‑time updates.

MCP vs. Traditional RAG

Limitations of Traditional RAG

Tightly Coupled Architecture : Retrieval logic and LLM calls are intertwined, making independent optimization hard.

Single Retrieval Strategy : Usually only vector search is used, lacking hybrid approaches.

Lack of Standardized Interfaces : Different implementations expose divergent APIs, preventing reuse.

High Maintenance Cost : System upgrades require extensive code changes.

Advantages of MCP‑Based Solution

Standardized Tool Calls : MCP defines a unified interface, reducing integration effort.

Decoupled Design : Model invocation is separated from business logic, enabling independent upgrades.

Flexible Extensibility : New data sources and modules (e.g., hybrid search, multimodal content) can be added easily.

Engineering‑Friendly : Aligns with software‑engineering best practices for team collaboration.

Tool‑Driven Implementation : All functionality (knowledge ingestion, retrieval, FAQ handling) is realized through prompts and LLM‑driven tool calls.

Project Background and Requirements

Modern enterprises face four main knowledge‑management pain points:

Knowledge Fragmentation : Documents are scattered across systems without a unified search entry.

Low Retrieval Efficiency : Keyword search cannot understand semantics, leading to inaccurate results.

Slow Knowledge Updates : Manual curation delays the reflection of the latest information.

High Usage Barrier : Technical jargon and complex query syntax hinder ordinary employees.

To address these issues, the system must satisfy four core requirements:

Intelligent Retrieval : Natural‑language queries that understand intent and context.

Automated Knowledge Processing : Automatic document chunking and FAQ extraction.

Flexible Expansion : Support for multiple data sources and model integrations.

Easy Deployment & Maintenance : Simple architecture that teams can quickly adopt and iterate.

Project Goals

Technical Goals

Build MCP‑compliant knowledge‑store service and client.

Implement document chunking, FAQ extraction, and vector embedding.

Support complex query decomposition and hybrid retrieval.

Application Goals

Provide a unified knowledge‑base management and retrieval portal.

Achieve >90% retrieval accuracy for internal queries.

Reduce knowledge‑base maintenance workload by 70%.

Enable intelligent processing of all corporate documents.

System Design and Implementation

The design references alibabacloud-tablestore-mcp-server , which uses Tablestore and Java. For better extensibility, the implementation switches to Milvus for vector storage and rewrites the server and client in Python.

The MCP‑based RAG system consists of three core components:

Knowledge‑Store Service (MCP Server) : Backend built on Milvus, responsible for document storage and vector retrieval.

MCP Client : Communicates with the server to perform knowledge ingestion and query operations.

LLM Integration : Handles document chunking, FAQ extraction, query decomposition, and answer generation.

Architecture diagram
Architecture diagram

Deployment of MCP Server

Prerequisites: Docker, Docker‑Compose, at least 4 CPU, 4 GB RAM, 20 GB disk.

# Enter project directory
cd mcp-rag

# Start Milvus and dependencies
docker compose up -d etcd minio standalone

# Create Python virtual environment
python -m venv env-mcp-rag
source env-mcp-rag/bin/activate

# Install dependencies
pip install -r requirements.txt

# Launch the server
python -m app.main

MCP Server Core API

The server exposes four tools:

storeKnowledge : Store raw documents into the knowledge store.

searchKnowledge : Perform similarity search on stored documents.

storeFAQ : Save extracted FAQ pairs into a dedicated FAQ store.

searchFAQ : Retrieve relevant FAQ entries.

Example implementation of storeKnowledge:

async def store_knowledge(self, content: str, metadata: Dict[str, Any] = None) -> Dict[str, Any]:
    """Store knowledge content to Milvus"""
    await self.ready_for_connections()
    try:
        knowledge_content = KnowledgeContent(content=content, metadata=metadata or {})
        self.milvus_service.store_knowledge(knowledge_content)
        return {"status": "success", "message": "Knowledge stored successfully"}
    except Exception as e:
        logger.error(f"Error storing knowledge: {e}")
        return {"status": "error", "message": str(e)}

RAG Client Implementation (MCP Client)

Key steps:

Knowledge‑Base Construction

Text chunking – ensure semantic completeness.

FAQ extraction – generate question‑answer pairs via LLM.

Vectorization – embed chunks and FAQs and store them in Milvus.

Text chunking code (excerpt):

def _chunk_text(self, text: str) -> List[str]:
    """Split text into chunks while preserving semantics"""
    chunks = []
    if len(text) <= self.chunk_size:
        chunks.append(text)
        return chunks
    start = 0
    while start < len(text):
        end = start + self.chunk_size
        if end < len(text):
            sentence_end = max(
                text.rfind('. ', start, end),
                text.rfind('? ', start, end),
                text.rfind('! ', start, end)
            )
            if sentence_end > start:
                end = sentence_end + 1
        chunks.append(text[start:min(end, len(text))])
        start = end - self.chunk_overlap
        if start >= len(text) or start <= 0:
            break
    return chunks

FAQ extraction prompt (simplified):

system_prompt = """You are a knowledge‑extraction expert. Extract up to 10 FAQ items from the given text. Output a JSON array with \"question\" and \"answer\" fields only."""
user_prompt = f"""Extract FAQs from the following text:
```
{text}
```"""
response = self.llm_client.sync_generate(prompt=user_prompt, system_prompt=system_prompt, temperature=0.3)

Question decomposition code:

async def _decompose_question(self, question: str) -> List[str]:
    """Break a complex question into simpler sub‑questions"""
    system_prompt = """You are a question‑analysis expert. Split the input into 2‑4 clear sub‑questions covering all aspects. Return a JSON array like [\"sub‑question1\", \"sub‑question2\"]."""
    user_prompt = f"""Decompose the following question:
{question}"""
    response = self.llm_client.sync_generate(prompt=user_prompt, system_prompt=system_prompt, temperature=0.3)
    # Parse JSON array from response
    ...

Context filtering (simplified):

async def _filter_context(self, question: str, context_items: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Select the most relevant context items for the final answer"""
    seen = set()
    filtered = []
    faq_items = [i for i in context_items if i["type"] == "faq"]
    knowledge_items = [i for i in context_items if i["type"] == "knowledge"]
    for item in faq_items + knowledge_items:
        content = item.get("content")
        if content and content not in seen:
            seen.add(content)
            filtered.append(item)
        if len(filtered) >= 6:
            break
    return filtered

Practical Demonstration

Build the knowledge base from a markdown file:

python -m app.main build --file test.md --title "RAG Basics" --author "Enterprise KB" --tags "LLM,RAG,KnowledgeBase"

Sample log output:

2025-05-11 14:50:16 | INFO | app.knowledge_builder:build_from_text:52 - Split text into 2 chunks
2025-05-11 14:50:59 | INFO | app.knowledge_builder:build_from_text:72 - Extracted 8 FAQs from text
2025-05-11 14:51:00 | INFO | __main__:build_knowledge_base:48 - Stored 2/2 chunks to knowledge base
2025-05-11 14:51:00 | INFO | __main__:build_knowledge_base:50 - Extracted and stored 8 FAQs

Query the system:

python -m app.main query --question "What advantages and drawbacks does RAG have compared to traditional enterprise knowledge bases?"

Result excerpt:

2025-05-11 15:01:46 | INFO | app.knowledge_retriever:query:39 - Decomposed question into 4 sub‑questions
2025-05-11 15:01:47 | INFO | app.knowledge_retriever:query:67 - Filtered 28 context items to 6

================================================================================
Question: RAG相比企业传统的知识库有什么优势和缺点
--------------------------------------------------------------------------------
Answer: Retrieval‑Augmented Generation (RAG) allows LLMs to dynamically access up‑to‑date internal knowledge, improving relevance, accuracy, and utility while keeping the model lightweight. It also introduces challenges such as system complexity, latency, and the need for robust retrieval pipelines.
================================================================================
Knowledge retrieval demo
Knowledge retrieval demo

Implementation Recommendations & Best Practices

Document Processing Strategy

Set chunk size to 1000‑1500 characters with 200‑300 character overlap.

Adjust chunking rules for technical vs. narrative documents.

Preserve original metadata (e.g., source, format) to improve retrieval precision.

Retrieval Optimization Techniques

Employ hybrid search (semantic vectors + keyword matching).

Generate 2‑4 sub‑questions during decomposition.

Limit total context items to 5‑8 to avoid information overload.

System Integration Tips

Choose an appropriate embedding model for the domain.

Design incremental indexing for real‑time knowledge updates.

Enable monitoring and logging to quickly detect failures.

Conclusion & Outlook

Using MCP to build a RAG system resolves many pain points of traditional pipelines—tight coupling, single‑strategy retrieval, and high maintenance cost—while offering a standardized, extensible framework for enterprise knowledge management. Future directions include multimodal content support, real‑time knowledge sync mechanisms, and adaptive retrieval tuned by user feedback.

References

Model Context Protocol (MCP) official documentation – https://modelcontextprotocol.io/introduction

Milvus vector database visual client Attu – https://milvus.io/docs/zh/quickstart_with_attu.md

MCP‑RAG practical code repository – https://github.com/FlyAIBox/mcp-in-action/tree/rag_0.1.1/mcp-rag

PythonLLMMCPRAGMilvusretrievalKnowledgeBase
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.