Artificial Intelligence 13 min read

Build a PPT‑Powered RAG Engine with Visual Models and MCP Server

This article explains how to construct a Retrieval‑Augmented Generation (RAG) pipeline for multi‑page PPT documents by converting slides to images, extracting content with a vision model, indexing with LlamaIndex and Chroma, and exposing the functionality through an MCP Server with tools for adding, querying, and managing PPTs.

AI Large Model Application Practice

Jul 2, 2025

Build a PPT‑Powered RAG Engine with Visual Models and MCP Server

Overview

This article describes a complete Retrieval‑Augmented Generation (RAG) pipeline for PowerPoint (PPT) files. It extracts slide images, parses visual content with a vision model, indexes the resulting Markdown and images, and provides interactive query capabilities through an MCP Server.

System Architecture

The system consists of four components:

Overall framework

MCP Server – exposes tools add_ppt, chat_with_ppt, delete_ppt, and index_status for managing PPT documents.

RAG engine – handles indexing and generation phases.

Performance testing – demonstrates end‑to‑end queries.

MCP Server Implementation

Each tool is defined with the @app.tool() decorator. Example implementations:

@app.tool()
async def add_ppt(ctx: Context, file_path: str, force_reprocess: bool = False) -> str:
    """Add the specified PPT document to the RAG index.
    Args:
        ctx: Context object
        file_path: Absolute or relative path to the PPT file
        force_reprocess: Re‑process even if the document already exists
    Returns:
        JSON string with the operation result"""
    try:
        rag_engine = ctx.request_context.lifespan_context.rag_engine
        result = await rag_engine.add_ppt_document(file_path, force_reprocess=force_reprocess)
        return json.dumps(result, indent=2, ensure_ascii=False)
    except Exception as e:
        return json.dumps({"error": str(e)})

@app.tool()
async def chat_with_ppt(ctx: Context, query: str, file_path: Optional[str] = None, doc_id: Optional[str] = None) -> str:
    try:
        rag_engine = ctx.request_context.lifespan_context.rag_engine
        result = await rag_engine.query(query, file_path=file_path, doc_id=doc_id)
        return json.dumps(result, indent=2, ensure_ascii=False)
    except Exception as e:
        return json.dumps({"error": str(e)})

RAG Engine Design

Indexing Phase

The indexing pipeline follows four steps:

PPT → Images : LibreOffice converts PPT to PDF; Pdfium splits the PDF into per‑slide PNG images.

Images → Markdown : A Doubao vision model parses each image using a detailed prompt and outputs Markdown that includes OCR text, tables, and descriptions.

Prepare Nodes : Markdown is wrapped into TextNode objects with metadata (source file, page number, image path, document ID). Nodes are cached with pickle to avoid re‑parsing.

Embedding & Indexing : Nodes are embedded with an OpenAI model, stored in a Chroma vector store, and persisted for fast reload.

# Create empty index if not present
if self._index is None:
    vector_store = self._initialize_vector_store()
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    self._index = VectorStoreIndex([], storage_context=storage_context, show_progress=False)
# Insert nodes and persist
self._index.insert_nodes(nodes)
self._persist_index()

Generation Phase

When a query arrives, the engine retrieves the top‑K most relevant nodes, fetches the associated slide images, and builds a prompt that combines the Markdown context and image references. The prompt is sent to the Doubao vision model, which generates a factual answer and cites the source slide and page number.

default_prompt = """
以下是PPT幻灯片中解析的Markdown文本和图片信息。Markdown文本已经尝试将相关图表转换为表格。优先使用图片信息来回答问题。在无法理解图像时使用Markdown文本信息。
---------------------
{context_str}
---------------------
-- 根据上下文信息并且不依赖先验知识, 回答查询。
-- 解释你是从解析的markdown、还是图片中得到答案的, 如果有差异, 请说明最终答案的理由。
-- 详细回答问题。
-- 给出重点参考的图片路径和页码。
查询: {query_str}
答案: """

Metadata filters (e.g., source_file_id) enable selective retrieval for a specific PPT.

filters = MetadataFilters(filters=[MetadataFilter(key="source_file_id", value=filter_doc_id, operator=FilterOperator.EQ)])
retriever = VectorIndexRetriever(index=self._index, similarity_top_k=self.top_k, filters=filters)

Testing and Demo

The MCP Server runs in Server‑Sent Events (SSE) mode. An interactive client can add PPTs, query index_status, and ask fact‑based questions. Responses include the answer and a reference such as “PPT name: slide 3”.

Future Optimizations

Generate slide summaries or hypothetical questions to enrich the vector store.

Apply relevance re‑ranking and multi‑step retrieval for higher accuracy.

Introduce additional index types (e.g., SummaryIndex) for different query patterns.

Integrate Agentic RAG to allow multiple tool calls within a single query.

Improve performance with asynchronous batch calls and parallel processing.

Resources

Source code: https://github.com/pingcy/app_chatppt

Python RAG PPT MCP Server Vision Model LlamaIndex

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.