Artificial Intelligence 9 min read

How to Build a Multimodal RAG Pipeline for PPT Documents with Vision LLMs

This article explains a step‑by‑step implementation of a multimodal Retrieval‑Augmented Generation system that parses PPT/PDF files, extracts rich text and images with vision models, indexes them in a vector store, and generates answers that combine markdown and relevant slide screenshots.

AI Large Model Application Practice

Mar 24, 2025

How to Build a Multimodal RAG Pipeline for PPT Documents with Vision LLMs

Background and Goal

Retrieval‑Augmented Generation (RAG) for heterogeneous, richly formatted documents such as PowerPoint presentations requires handling text, annotations, charts, and images. This summary describes a multimodal RAG pipeline built for a Chinese LLM benchmark PPT, enabling combined text‑and‑image answers to slide‑level questions.

Repository

https://github.com/pingcy/multimodal_ppt_rag

Overall Solution and Tools

Document parsing: Doubao vision model or LlamaParse with vision enabled

Vector store: local Chroma

Embedding model: Alibaba Cloud Embedding‑V3

Generation model: Doubao vision model

Framework: LlamaIndex or LangChain (interchangeable)

Document Parsing and Indexing

Each PPT slide is rendered to an image (screenshot). The image is fed to a multimodal vision model, which returns a rich Markdown representation that captures text, tables, and chart descriptions—not just raw OCR output. The Markdown chunk and the image file path are stored together as a single node in the vector store, with the image path saved in the node metadata.

Optional post‑processing with an LLM can enrich each Markdown chunk by generating a concise summary and five hypothetical questions for the slide.

Retrieval and Generation

At query time the pipeline performs the following steps:

Retrieve the most relevant nodes (chunks) from the vector store using the embedding model.

Extract the image_path metadata from each node to locate the corresponding slide screenshots.

Compose a prompt that concatenates the retrieved Markdown texts and references the images, then send the prompt and image list to the multimodal LLM.

Parse the LLM’s JSON response, which contains a Markdown answer and a list of image paths, and render the final answer as Markdown with embedded images.

Custom Query Engine (Python)

lvm = DoubaoVisionLLM(model_name='your_doubao_model_name')

class MultimodalQueryEngine(CustomQueryEngine):
    def custom_query(self, query_str: str):
        # Retrieve associated chunks (nodes)
        nodes = recursive_retrieve(query_str)
        # Assemble prompt with image references
        context_str = "

".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) + f"
以上来自图片：{r.metadata['image_path']}" for r in nodes]
        )
        fmt_prompt = self.qa_prompt.format(context_str=context_str, query_str=query_str)
        # Generate response with images
        response = self.multi_modal_llm.generate_response(
            prompt=fmt_prompt,
            image_paths=[n.metadata["image_path"] for n in nodes]
        )
        return response

multi_query_engine = MultimodalQueryEngine(multi_modal_llm=lvm)

Prompt Output Format

{
  "response": "# your Markdown answer #",
  "image_path": ["# most relevant image path #"]
}

Post‑processing Code

response_json = json.loads(response)
answer = response_json.get("response", "")
image_paths = response_json.get("image_path", [])
markdown_output = f"### 答案:

{answer}

### 参考来源:
"
for image_path in image_paths:
    markdown_output += f"![Image]({image_path})
"

Testing the Pipeline

response = multi_query_engine.query("这次评测中表现最好的开源模型有哪些？")
from IPython.display import Markdown, display
display(Markdown(response.response))

Issues and Optimizations

Vision models may produce occasional errors on ambiguous regions.

Multimodal generation is slower and consumes more tokens than pure‑text LLMs.

Retrieval accuracy can drop with larger PPT collections or vague queries; possible mitigations include finer chunking, hierarchical retrieval, or hybrid keyword‑vector indexing.

Metadata‑based pre‑filtering, agentic RAG for different question types, and experimenting with alternative embedding or vision models can improve performance.

In production, store slide images on shared storage and reference them via URIs.

Conclusion

The multimodal RAG pipeline successfully answers questions about complex PPT documents by combining vision‑enhanced text extraction, vector similarity search, and image‑aware generation. The design highlights the importance of preserving slide‑level image context and provides a foundation for further refinements such as hierarchical retrieval, hybrid indexing, and agentic query handling.

Python LLM RAG multimodal vision

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.