How to Build a Multimodal RAG Pipeline for PPT Documents with Vision LLMs
This article explains a step‑by‑step implementation of a multimodal Retrieval‑Augmented Generation system that parses PPT/PDF files, extracts rich text and images with vision models, indexes them in a vector store, and generates answers that combine markdown and relevant slide screenshots.
Background and Goal
Retrieval‑Augmented Generation (RAG) for heterogeneous, richly formatted documents such as PowerPoint presentations requires handling text, annotations, charts, and images. This summary describes a multimodal RAG pipeline built for a Chinese LLM benchmark PPT, enabling combined text‑and‑image answers to slide‑level questions.
Repository
https://github.com/pingcy/multimodal_ppt_rag
Overall Solution and Tools
Document parsing: Doubao vision model or LlamaParse with vision enabled
Vector store: local Chroma
Embedding model: Alibaba Cloud Embedding‑V3
Generation model: Doubao vision model
Framework: LlamaIndex or LangChain (interchangeable)
Document Parsing and Indexing
Each PPT slide is rendered to an image (screenshot). The image is fed to a multimodal vision model, which returns a rich Markdown representation that captures text, tables, and chart descriptions—not just raw OCR output. The Markdown chunk and the image file path are stored together as a single node in the vector store, with the image path saved in the node metadata.
Optional post‑processing with an LLM can enrich each Markdown chunk by generating a concise summary and five hypothetical questions for the slide.
Retrieval and Generation
At query time the pipeline performs the following steps:
Retrieve the most relevant nodes (chunks) from the vector store using the embedding model.
Extract the image_path metadata from each node to locate the corresponding slide screenshots.
Compose a prompt that concatenates the retrieved Markdown texts and references the images, then send the prompt and image list to the multimodal LLM.
Parse the LLM’s JSON response, which contains a Markdown answer and a list of image paths, and render the final answer as Markdown with embedded images.
Custom Query Engine (Python)
lvm = DoubaoVisionLLM(model_name='your_doubao_model_name')
class MultimodalQueryEngine(CustomQueryEngine):
def custom_query(self, query_str: str):
# Retrieve associated chunks (nodes)
nodes = recursive_retrieve(query_str)
# Assemble prompt with image references
context_str = "
".join(
[r.get_content(metadata_mode=MetadataMode.LLM) + f"
以上来自图片:{r.metadata['image_path']}" for r in nodes]
)
fmt_prompt = self.qa_prompt.format(context_str=context_str, query_str=query_str)
# Generate response with images
response = self.multi_modal_llm.generate_response(
prompt=fmt_prompt,
image_paths=[n.metadata["image_path"] for n in nodes]
)
return response
multi_query_engine = MultimodalQueryEngine(multi_modal_llm=lvm)Prompt Output Format
{
"response": "# your Markdown answer #",
"image_path": ["# most relevant image path #"]
}Post‑processing Code
response_json = json.loads(response)
answer = response_json.get("response", "")
image_paths = response_json.get("image_path", [])
markdown_output = f"### 答案:
{answer}
### 参考来源:
"
for image_path in image_paths:
markdown_output += f"
"Testing the Pipeline
response = multi_query_engine.query("这次评测中表现最好的开源模型有哪些?")
from IPython.display import Markdown, display
display(Markdown(response.response))Issues and Optimizations
Vision models may produce occasional errors on ambiguous regions.
Multimodal generation is slower and consumes more tokens than pure‑text LLMs.
Retrieval accuracy can drop with larger PPT collections or vague queries; possible mitigations include finer chunking, hierarchical retrieval, or hybrid keyword‑vector indexing.
Metadata‑based pre‑filtering, agentic RAG for different question types, and experimenting with alternative embedding or vision models can improve performance.
In production, store slide images on shared storage and reference them via URIs.
Conclusion
The multimodal RAG pipeline successfully answers questions about complex PPT documents by combining vision‑enhanced text extraction, vector similarity search, and image‑aware generation. The design highlights the importance of preserving slide‑level image context and provides a foundation for further refinements such as hierarchical retrieval, hybrid indexing, and agentic query handling.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
