28 min read

Unlocking Modern AI Application Architecture: From RAG to Agents and MCP

This article surveys the evolution of AI applications, explains large language model fundamentals, outlines architectural challenges, and introduces three core patterns—Retrieval‑Augmented Generation (RAG), autonomous Agents, and Model Context Protocol (MCP)—while providing practical LangChain code snippets and integration guidance.

Didi Tech

Jun 5, 2025

Unlocking Modern AI Application Architecture: From RAG to Agents and MCP

Large Model Application Architecture Basics

Artificial intelligence applications have progressed through several pivotal stages, each marking a major shift in technical paradigms.

AI Application Evolution Overview

Large language models (LLMs) are now the core component of modern AI solutions, but they possess distinct technical characteristics and capability boundaries that must be understood for effective architecture design.

Large Language Model Fundamentals

LLMs serve as the central engine of AI applications; grasping their strengths and limits is essential for building robust systems.

AI Application Architecture Challenges

Despite their power, LLM‑based systems face multiple architectural hurdles, including knowledge staleness, hallucinations, domain‑specific depth, transparency, and private‑knowledge integration.

Emerging Architectural Patterns

These challenges have given rise to three complementary patterns—Retrieval‑Augmented Generation (RAG), Agent‑based decision‑execution, and Model Context Protocol (MCP)—which together form a modern AI application architecture that overcomes the native limitations of LLMs.

Modern AI Application Architecture Framework

The framework is multi‑layered and modular, consisting of the following key tiers:

Document Processing System

Embedding Model

Vector Store

Retrieval Augmentation System

Generative Model

Post‑processing System

Subsequent sections dive deeper into RAG, Agent, and MCP.

RAG

Basic Concept

RAG (Retrieval‑Augmented Generation) combines retrieval and generation, pulling relevant information from external knowledge bases to supplement LLM knowledge, thereby producing more accurate and up‑to‑date responses.

Problems Solved by RAG

Knowledge update: connects to real‑time external sources.

Model hallucination: provides factual grounding.

Domain expertise: accesses specialized corpora.

Transparency & traceability: reveals source documents.

Private knowledge: enables proprietary knowledge bases.

Core Components

Document Processing System : cleans, chunks, extracts metadata, and normalizes raw documents. Tools include LangChain loaders, LlamaIndex parsers, Unstructured, PyPDF2, NLTK, spaCy.

Embedding Model : converts text to dense vectors for semantic search. Options: OpenAI text‑embedding‑ada, Cohere Embed, BAAI/bge‑large, Jina embeddings.

Vector Store : stores vectors and provides similarity search. Options: Pinecone, Weaviate, Milvus, ChromaDB, FAISS, Qdrant.

Retrieval Augmentation System : query rewriting, hybrid retrieval, re‑ranking (HyDE, Cohere Rerank, semantic routing).

Generative Model : generates answers using retrieved context. Options: OpenAI GPT‑4, Anthropic Claude, Cohere Command, open‑source Mistral, Llama 3, DeepSeek.

Post‑processing System : fact‑checking, citation, formatting, hallucination detection, content filtering.

Building RAG Quickly with LangChain

LangChain simplifies connecting LLMs with external data sources and provides reusable components for the entire RAG pipeline.

<span>def read_word_document(file_path):
    doc = docx.Document(file_path)
    paragraphs = [para.text.strip() for para in doc.paragraphs if para.text.strip()]
    return "
".join(paragraphs)  # merge paragraphs</span>

<span>def split_text(text, chunk_size=100, chunk_overlap=10):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=["
", "。", "？", "！"])
    return text_splitter.split_text(text)</span>

<span>def load_embedding_model(model_name="moka-ai/m3e-base"):
    embeddings = HuggingFaceEmbeddings(model_name="moka-ai/m3e-base", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True})
    return embeddings</span>

<span>from langchain_community.vectorstores import FAISS

def store_to_vector_db(docs, db_path="faiss_index"):
    embeddings = load_embedding_model()
    vector_db = FAISS.from_documents(docs, embeddings)
    vector_db.save_local(db_path)</span>

<span>def search_similar_texts(query, vector_db, top_k=3):
    results = vector_db.similarity_search(query, k=top_k)
    return [r.page_content for r in results]</span>

<span>def fulltext_search(query):
    conn = sqlite3.connect(DB_FILE)
    cursor = conn.cursor()
    data = " OR ".join(jieba.cut(query))
    cursor.execute("SELECT ori, bm25(documents) AS score FROM documents WHERE content MATCH ? ORDER BY score DESC LIMIT 3", (data,))
    results = cursor.fetchall()
    conn.close()
    return [item[0] for item in results]
</span>

<span>def query_knowledge_base(user_query, index_path="faiss_index"):
    keyword_results = fulltext_search(user_query)
    vector_results = search_word_vector(user_query, load_vector_db(index_path))
    return vector_results + keyword_results  # simple merge</span>

<span>def generate(prompt: str, model: str = "deepseek-r1:1.5b", stream: bool = False) -> str:
    url = "http://localhost:11434/api/generate"
    payload = {"model": model, "prompt": prompt, "options": {"temperature": 0.7, "num_predict": 8192}}
    response = requests.post(url, json=payload)
    return response.json()["response"]
</span>

Agent

Basic Concept

An Agent is an autonomous software entity that perceives its environment, makes decisions, and takes actions to achieve specific goals, often following a ReAct (Reason‑Act‑Observe) loop.

Core Components

Reasoning Engine (LLM)

Tool Set (APIs)

Memory (interaction history)

Planner (task decomposition)

Executor (tool invocation)

Observer (result parsing)

Prompt Templates

Feedback Loop (strategy adjustment)

Agent Execution Flow (ReAct)

The user query is interpreted by the LLM, the planner splits the task, the LLM generates action commands, the executor calls tools, the observer feeds results back, and the feedback loop refines the plan until a final answer is produced.

<span>def run(self, task: str) -> str:
    # Memory: record task
    self.memory.add_message("user", task)
    # Planner: create plan
    plan = self.planner.create_plan(task)
    self.memory.save_state("plan", plan)
    completed_steps = []
    for step in plan:
        step_id = step["step_id"]
        description = step["description"]
        tool_name = step.get("tool")
        print(f"Executing step {step_id}: {description}")
        if tool_name:
            # LLM decides how to use the tool
            system_msg = self.system_prompt.format(tools_description=self._format_tools_description())
            messages = [{"role": "system", "content": system_msg}, {"role": "user", "content": f"Please help me with this step: {description}. Use a tool if needed."}]
            response = self.llm_engine.generate(messages)
            self.memory.add_message("assistant", response)
            tool_calls = self._parse_tool_calls(response)
            for tool_call in tool_calls:
                try:
                    result = self.executor.execute_tool(tool_call["tool_name"], **tool_call["parameters"]) 
                    observation = self.observer.process_result(description, result)
                    step_result = {"step_id": step_id, "description": description, "tool_used": tool_call["tool_name"], "parameters": tool_call["parameters"], "result": result, "observation": observation}
                    completed_steps.append(step_result)
                    self.memory.add_message("system", f"Tool execution result: {result}")
                    # Feedback loop
                    remaining_steps = [s for s in plan if s["step_id"] not in [cs["step_id"] for cs in completed_steps]]
                    feedback = self.feedback_loop.evaluate_and_adjust(task, completed_steps, observation, remaining_steps)
                    if feedback.get("needs_adjust", False):
                        plan = [s for s in completed_steps] + feedback.get("new_plan", [])
                        self.memory.save_state("plan", plan)
                        print("Plan adjusted")
                except Exception as e:
                    error_msg = f"Error executing step {step_id}: {str(e)}"
                    print(error_msg)
                    self.memory.add_message("system", error_msg)
        else:
            completed_steps.append({"step_id": step_id, "description": description, "completed": True})
    # Final summary generation
    summary_prompt = f"You helped the user complete the task: {task}. Completed steps:
{json.dumps(completed_steps, ensure_ascii=False, indent=2)}
Provide a concise summary."
    summary = self.llm_engine.generate([{"role": "user", "content": summary_prompt}])
    self.memory.add_message("assistant", summary)
    return summary
</span>

Model Context Protocol (MCP)

Basic Concept

MCP standardizes how LLMs interact with external data sources, services, and tools, enabling structured access to contextual information beyond the model's internal knowledge.

Core Components

MCP Host (runtime manager)

MCP Server (tool registration & request handling)

MCP Client (LLM integration layer)

Tool Provider (implements specific tools)

LLM Integration Layer

MCP vs. Function Call

Function Call focuses on generating structured parameters for predefined functions, while MCP provides a full ecosystem for dynamic tool discovery, registration, and execution.

Integrating MCP into AI Platforms (Example: Cursor)

Configure a custom MCP server in Cursor to call a GitHub API or a local weather‑forecast tool.

Building a Simple MCP Server (FastMCP)

<span># Initialize FastMCP server
mcp = FastMCP("weather")

@mcp.tool()
async def get_forecast(latitude: float, longitude: float) -> str:
    """Fetch weather forecast for a location"""
    # mock implementation
    return mock_data

@mcp.tool()
async def create_note(content: str) -> str:
    """Create a new note with the given content"""
    subprocess.run(['open', '-a', 'Notes'])
    applescript = f'''\
    tell application "Notes"
        activate
        make new note at folder "Notes" with properties {{body:"{content}"}}
    end tell
    '''
    subprocess.run(['osascript', '-e', applescript])
    return "Note created successfully"

if __name__ == "__main__":
    mcp.run(transport='stdio')
</span>

HTTP‑Based MCP Example (FastAPI)

<span>app = FastAPI()

async def get_weather_data(location: str, date: str) -> Dict[str, Any]:
    return generate_mock_weather(location, date)

@app.post("/api/weather")
async def get_weather(request: WeatherRequest):
    data = await get_weather_data(request.location, request.date)
    return {"status": "success", "data": data}

@app.get("/api/list_tools")
async def list_tools():
    tools = [{
        "name": "get_weather",
        "description": "Get weather forecast for a specific location and date",
        "endpoint": "/api/weather",
        "method": "POST",
        "params": [
            {"name": "location", "type": "string", "description": "The location to get weather for"},
            {"name": "date", "type": "string", "description": "The date to get weather for"}
        ]
    }]
    return {"status": "success", "tools": tools}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
</span>

Dynamic Tool Loading for LLMs

<span>async def fetch_tools_from_server() -> List[Dict[str, Any]]:
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{MCP_SERVER_URL}/api/list_tools")
        return response.json()["tools"]

def create_dynamic_tool_executor(tool_info: Dict[str, Any]) -> Callable:
    async def execute_api_call(*args, **kwargs):
        payload = {}
        param_names = [p["name"] for p in tool_info["params"]]
        for i, arg in enumerate(args):
            if i < len(param_names):
                payload[param_names[i]] = arg
        for k, v in kwargs.items():
            if k in param_names:
                payload[k] = v
        async with httpx.AsyncClient() as client:
            is_post = tool_info["method"] == "POST"
            method = client.post if is_post else client.get
            response = await method(f"{MCP_SERVER_URL}{tool_info['endpoint']}", json=payload if is_post else None, params=None if is_post else payload)
            if response.status_code == 200:
                return response.json()["data"]
            raise Exception(f"API call failed: {response.status_code} {response.text}")
    def sync_executor(*args, **kwargs):
        return asyncio.run(execute_api_call(*args, **kwargs))
    sync_executor.__name__ = tool_info["name"]
    sync_executor.__doc__ = tool_info["description"]
    return sync_executor

def create_tools_from_server_data(tool_data: List[Dict[str, Any]]) -> List[Tool]:
    tools = []
    for info in tool_data:
        executor = create_dynamic_tool_executor(info)
        tools.append(Tool(name=info["name"], func=executor, description=info["description"]))
    return tools
</span>

Acknowledgments

The article reviews numerous open‑source AI projects—including LangChain, Ollama, Open WebUI, Dify, and others—that have significantly advanced the AI tooling ecosystem. Gratitude is extended to the developers and vibrant communities behind these contributions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM MCP LangChain RAG agent

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.