How to Build an Agentic RAG System from Scratch Using MCP Architecture

This article walks through the design and full implementation of an Agentic Retrieval‑Augmented Generation (RAG) system built on the MCP standard, covering the conceptual fusion of MCP and RAG, server‑side tool creation with LlamaIndex, client‑side agent construction with LangGraph, configuration files, caching strategies, code examples, and an end‑to‑end demonstration.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
How to Build an Agentic RAG System from Scratch Using MCP Architecture

Introduction

During the May Day holiday the author built a complete Agentic RAG system from zero using the MCP architecture to showcase interesting integrations between MCP, RAG, and agents.

Conceptual Fusion of MCP and Agentic RAG

RAG (Retrieval‑Augmented Generation) provides external knowledge to LLMs, while MCP (Modular Component Platform) focuses on supplying external tools. Both aim to enrich model capabilities, but MCP offers tools (e.g., a calculator) and RAG supplies reference material (e.g., a book). The two are complementary and can be combined without conflict.

Typical Agentic RAG Scenario

A typical Agentic RAG application may need to answer factual, summarization, and multi‑document fusion questions, sometimes invoking a search engine for additional information.

Architecture Overview

The system is divided into an MCP Server that implements the RAG pipeline (using LlamaIndex) and an MCP Client that implements the agent (using LangGraph). The client and server communicate via either SSE or stdio modes, following a clear client/server responsibility split.

Server‑Side Tools

create_vector_index – creates or loads a document vector index with caching.

query_document – queries factual information from a vector index.

get_document_summary – queries summarization information (uses LlamaIndex SummaryIndex).

list_indies – auxiliary tools, including a simple web‑search tool.

Caching Mechanism

To avoid repeated parsing and indexing, the server caches document node splits and index metadata. Cache keys are derived from a hash of the document content combined with chunk size and overlap parameters. Re‑creation occurs only when the client forces it, the cache is missing, or the document has changed.

Tool Implementation Example: create_vector_index

@app.tool()
async def create_vector_index(
    ctx: Context,
    file_path: str,
    index_name: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50,
    force_recreate: bool = False,
) -> str:
    """Create or load a document vector index (using cached nodes).
    Args:
        ctx: context object
        file_path: path to the document
        index_name: name of the index
        chunk_size: size of text chunks
        chunk_overlap: overlap between chunks
        force_recreate: whether to force re‑creation
    Returns:
        Description of the operation result
    """
    storage_path = f"{storage_dir}/{index_name}"
    # Determine if recreation is needed
    need_recreate = (
        force_recreate or
        not os.path.exists(storage_path) or
        not os.path.exists(get_cache_path(file_path, chunk_size, chunk_overlap))
    )
    if os.path.exists(storage_path) and not need_recreate:
        return f"Index {index_name} already exists, no need to create"
    # Delete existing collection if any
    try:
        chroma.delete_collection(name=index_name)
    except Exception as e:
        logger.warning(f"Error deleting collection (may be first creation): {e}")
    collection = chroma.get_or_create_collection(name=index_name)
    vector_store = ChromaVectorStore(chroma_collection=collection)
    # Load and split document
    nodes = await load_and_split_document(ctx, file_path, chunk_size, chunk_overlap)
    logger.info(f"Loaded {len(nodes)} nodes")
    # Build vector index
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    vector_index = VectorStoreIndex(nodes, storage_context=storage_context, embed_model=embedded_model)
    vector_index.storage_context.persist(persist_dir=storage_path)
    return f"Successfully created index: {index_name}, containing {len(nodes)} nodes"
    # Exception handling omitted for brevity

Tool Implementation Example: query_document

@app.tool()
async def query_document(
    ctx: Context,
    index_name: str,
    query: str,
    similarity_top_k: int = 5,
) -> str:
    """Query factual information from a document.
    Args:
        ctx: context object
        index_name: name of the index
        query: query text
        similarity_top_k: number of similar nodes to return
    Returns:
        Query result string
    """
    # Implementation omitted for brevity
    ...

Client Configuration

The client uses two JSON files:

mcp_config.json – defines server connections, transport mode (e.g., "sse"), URLs, and allowed tools.

doc_config.json – lists documents to be indexed, their descriptions, index names, chunk sizes, and overlaps.

{
  "servers": {
    "rag_server": {
      "transport": "sse",
      "url": "http://localhost:5050/sse",
      "allowed_tools": [
        "load_and_split_document",
        "create_vector_index",
        "get_document_summary",
        "query_document"
      ]
    }
    // ...other servers
  }
}
{
  "data/c-rag.pdf": {
    "description": "c‑rag technical paper",
    "index_name": "c-rag",
    "chunk_size": 500,
    "chunk_overlap": 50
  },
  "data/questions.csv": {
    "description": "Tax question dataset",
    "index_name": "tax-questions",
    "chunk_size": 500,
    "chunk_overlap": 50
  }
  // ...other documents
}

Main Program Flow

The client creates a MultiServerMCPClient from mcp_config.json, connects to the server, builds an AgenticRAGLangGraph instance, processes files to create vector indexes, builds the LangGraph agent, and finally enters an interactive REPL.

client = MultiServerMCPClient.from_config('mcp_config.json')
async with client as mcp_client:
    logger.info(f"Connected to MCP servers: {', '.join(mcp_client.get_connected_servers())}")
    rag = AgenticRAGLangGraph(client=mcp_client, doc_config=doc_config)
    await rag.process_files()          # create indexes, deduplicate
    await rag.build_agent()            # create ReAct agent
    await rag.chat_repl()              # interactive chat

Agent Construction (build_agent)

async def build_agent(self) -> None:
    mcp_tools = await self.client.get_tools_for_langgraph()
    self.agent = create_react_agent(
        model=llm,
        tools=mcp_tools,
        prompt=SYSTEM_PROMPT.format(
            doc_info_str=doc_info_str,
            current_time=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        ),
    )
    logger.info("===== Agent construction completed =====")

End‑to‑End Demonstration

The article outlines three test steps:

Start the MCP RAG‑Server (SSE mode) – the server prints its tool list.

Place knowledge documents in the data/ directory, configure mcp_config.json and doc_config.json, then run the client script python rag_agent_langgraph.py. The first run creates vector indexes; subsequent runs reuse cached indexes.

Enter the interactive REPL and issue queries such as:

Cross‑document factual queries that trigger multiple RAG pipelines and a web‑search fallback.

Summarization queries that use a SummaryIndex.

Natural‑language commands to rebuild an index (e.g., re‑create the CSV index), demonstrating tool‑driven index management.

Each step includes screenshots (omitted here) showing logs and responses.

Observations and Conclusions

MCP enforces modular, loosely‑coupled design, improving maintainability, extensibility, and deployment flexibility.

The architecture is stack‑agnostic: the server can use LlamaIndex while the client uses LangGraph, and different languages could be swapped in.

Standardized module interaction enables other developers to build agents on top of an existing RAG server without needing to understand its internal implementation.

Future work includes parallel processing for large document collections, progress reporting, multimodal parsing, and further performance optimizations.

PythonLLMMCPLangGraphLlamaIndexAgentic RAG
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.