Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

This article explains how to handle complex multimodal PDFs in RAG systems, outlines extraction, indexing, and multimodal model integration, details four query‑rewriting strategies (HyDE, stepwise, sub‑question, backward), and presents key evaluation metrics and tools for assessing RAG performance.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

Multimodal Document Processing

Enterprise knowledge often resides in semi‑structured or unstructured formats such as mixed‑content PDFs. Effective RAG pipelines first use third‑party PDF parsers to separate text, tables, and images. Text is embedded directly; tables are converted to descriptive summaries via LLMs; images are processed with multimodal vision models (e.g., Qwen‑VL, GPT‑4V) combined with OCR, either extracting pure text or generating image summaries.

After extraction, each modality uses a suitable indexing strategy: plain text uses standard vector embeddings; tables benefit from LLM‑generated descriptions before embedding; images are indexed either as OCR‑derived text or as embeddings from vision models. During retrieval, associated chunks are linked back to the original table or image for downstream generation.

Query Rewriting Techniques

The article introduces four common query‑rewriting patterns to improve retrieval relevance:

HyDE (Hypothetical Document Embeddings) : generate a hypothetical answer, embed it, retrieve similar knowledge, then combine with the original query for final generation.

Stepwise Question Rewriting : decompose a complex query into a sequence of simpler sub‑questions, retrieve answers step by step, and synthesize the final response.

Sub‑Question Rewriting : generate multiple related sub‑questions, retrieve answers for each, and aggregate them to answer the original query.

Backward Question Rewriting : transform the original query into a more general “backward” question, retrieve knowledge for that, then use both the backward answer and original query to produce the final answer.

Both LangChain (via MultiVectorRetriever) and LlamaIndex (via RecursiveRetriever) provide built‑in support for the necessary associative retrieval linking vectors to original tables or images.

RAG Evaluation

Evaluating RAG applications requires metrics beyond traditional software testing because outputs are natural language. Core evaluation inputs include the user question, generated answer, context passages, and a human‑annotated reference answer. Important metrics are:

Correctness : similarity between answer and reference answer.

Semantic Similarity : semantic overlap between answer and reference.

Faithfulness : consistency of the answer with the retrieved context (avoiding hallucinations).

Context Relevancy : relevance of retrieved contexts to the question.

Answer Relevancy : how well the answer addresses the question, regardless of correctness.

Context Precision : ranking quality of relevant context entries against the reference answer.

Context Recall : proportion of reference‑answer content covered by retrieved contexts.

Frameworks such as LlamaIndex’s Evaluation module, LangChain’s LangSmith , and third‑party tools like RAGAS and LangFuse provide implementations of these metrics.

RAGDocument ParsingmultimodalevaluationQuery Rewriting
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.