Multimodal RAG with LangChain: PDF Parsing, Chunking, and Citation Guide
This article walks through building a LangChain‑based multimodal RAG system that parses PDFs (both native and scanned), splits them into semantic chunks, stores embeddings in a vector database, and generates answers with precise source citations, complete with code samples and API integration.
Standard PDF processing pipeline
The workflow consists of document extraction, optional OCR, text chunking, semantic search, vectorization, and citation generation.
Document extraction
Native PDFs : use PyMuPDF or PyPDF2 to extract text, images, and tables. Images are replaced by URL placeholders to preserve document structure.
Scanned PDFs : apply OCR engines such as DeepSeek‑OCR, Paddle‑OCR, or OCRFlux‑3b (recommended for complex tables) to convert page images into Markdown.
Chunking
After extraction, the document is split into semantically complete fragments either by paragraph or by fixed length (e.g., using RecursiveCharacterTextSplitter with chunk_size=1000 and chunk_overlap=200). All chunks are stored in a unified knowledge base.
Semantic retrieval
When a user query arrives, the system performs a semantic search over the chunk store, returns the most relevant fragments, and combines them with the original query before sending the prompt to the large model.
Vectorization
Each chunk is converted into a high‑dimensional embedding. A vector database stores the embeddings together with the original text, enabling accurate similarity matching.
Citation implementation
Metadata (document ID, page number, paragraph position) is stored in a relational table. The system prompt forces the model to cite sources using a [1], [2] format. The function extract_references_from_content parses the model’s output, matches citation numbers to stored chunks, and returns structured reference data.
Core workflow
OCR extraction – obtain raw content from scanned PDFs.
Chunking – create semantic blocks with metadata (source info, page number, etc.).
Vector storage – embed blocks and save vectors and metadata.
Prompt construction – merge system prompt, relevant chunks, and user query.
Code implementation
Environment setup: pip install PyMuPDF langchain_text_splitters Key class PDFProcessor provides: extract_pdf_pages_as_images: converts each PDF page to a base64‑encoded PNG for OCR. read_pdf_pages: validates file existence and reads raw pages. process_pdf: reads the PDF, extracts text, splits it with RecursiveCharacterTextSplitter, and builds document chunks that include metadata such as source_info and chunk_id.
Data models (Pydantic BaseModel):
class ContentBlock(BaseModel):
type: str = Field(description="content type: text, image, audio")
content: Optional[str] = Field(description="content data")
class MessageRequest(BaseModel):
content_blocks: List[ContentBlock] = Field(default=[], description="content blocks")
history: List[Dict[str, Any]] = Field(default=[], description="dialogue history")
pdf_chunks: List[Dict[str, Any]] = Field(default=[], description="PDF document chunk info for citation")
class MessageResponse(BaseModel):
content: str
timestamp: str
role: str
references: List[Dict[str, Any]] # PDF citationsMessage creation enforces citation rules:
def create_multimodal_message(request: MessageRequest, image_file: UploadFile | None, audio_file: UploadFile | None) -> HumanMessage:
# Build a list of content dicts for text, image_url, audio_url
# Append PDF chunks as a formatted reference block
# Ensure the final text block contains the "=== 参考文档内容 ===" section
... def convert_history_to_messages(history: List[Dict[str, Any]]) -> List[BaseMessage]:
system_prompt = """
你是一个专业的多模态 RAG 助手,具备以下能力:
1. 与用户对话
2. 图像识别(OCR、目标检测、场景理解)
3. 音频转写与分析
4. 知识检索与问答
引用格式要求:当答案基于提供的参考文档时,必须在相关信息后添加引用标记,如 [1]、[2]。
"""
messages.append(SystemMessage(content=system_prompt))
...Streaming response generator integrates citation extraction:
async def generate_streaming_response(messages: List[BaseMessage], pdf_chunks: List[Dict[str, Any]] = None) -> AsyncGenerator[str, None]:
model = get_chat_model()
full_response = ""
async for chunk in model.astream(messages):
if hasattr(chunk, "content") and chunk.content:
content = chunk.content
full_response += content
yield f"data: {json.dumps({"type": "content_delta", "content": content, "timestamp": datetime.now().isoformat()}, ensure_ascii=False)}
"
references = extract_references_from_content(full_response, pdf_chunks) if pdf_chunks else []
yield f"data: {json.dumps({"type": "message_complete", "full_content": full_response, "timestamp": datetime.now().isoformat(), "references": references}, ensure_ascii=False)}
"FastAPI endpoint that ties everything together:
@app.post("/api/chat/stream")
async def chat_stream(
image_file: UploadFile | None = File(None),
content_blocks: str = Form("[]"),
history: str = Form("[]"),
audio_file: UploadFile | None = File(None),
pdf_file: UploadFile | None = File(None)
):
content_blocks_data = json.loads(content_blocks)
history_data = json.loads(history)
if pdf_file:
pdf_processor = PDFProcessor()
pdf_content = await pdf_file.read()
pdf_chunks = await pdf_processor.process_pdf(file_content=pdf_content, filename=pdf_file.filename)
request_data = MessageRequest(content_blocks=content_blocks_data, history=history_data, pdf_chunks=pdf_chunks)
else:
request_data = MessageRequest(content_blocks=content_blocks_data, history=history_data)
messages = convert_history_to_messages(request_data.history)
current_message = create_multimodal_message(request_data, image_file=image_file, audio_file=audio_file)
messages.append(current_message)
return StreamingResponse(
generate_streaming_response(messages, pdf_chunks if pdf_file else None),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "Connection": "keep-alive", "Content-Type": "text/event-stream"}
)Testing
Using Postman, a PDF about the historical figure Guan Yu was uploaded. The system extracted the relevant paragraph, returned an answer with citation [0], and the reference included full metadata (source, page number, chunk ID).
Future improvements
Integrate a dedicated vector database (e.g., Pinecone, Milvus) to replace full‑text matching with semantic similarity search.
Support multiple PDF documents to build a larger, multi‑source knowledge base.
Enhance the OCR pipeline by incorporating Paddle‑OCR or DeepSeek‑OCR for higher accuracy on scanned documents.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
