Build a Multimodal RAG Pipeline with Kotaemon, Azure Document Intelligence, and VLM
This guide walks through setting up the open‑source Kotaemon framework, configuring Azure Document Intelligence and a visual large model, and implementing code to extract and caption images and tables from PDFs for end‑to‑end multimodal RAG applications.
Overview
Kotaemon is an open‑source RAG framework that combines traditional RAG with GraphRAG and provides a UI front‑end. It supports multimodal document parsing, hybrid retrieval, and multiple generation modes such as complex‑question‑decomposition and Agentic RAG.
Key Features
Hybrid retrieval pipeline that mixes vector and full‑text search with a re‑ranking model.
Multiple generation/inference modes, including complex and Agentic RAG.
Built‑in support for multimodal documents (PDF tables, images) using Azure Document Intelligence.
Extensible pipeline with Microsoft GraphRAG integration.
Getting Started
Clone the repository and set up a conda environment:
# Create virtual environment
conda create -n kotaemon python=3.10
conda activate kotaemon
# Clone the code
git clone https://github.com/Cinnamon/kotaemon
cd kotaemon
# Install dependencies
pip install -e "libs/kotaemon[all]"
pip install -e "libs/ktem"Create a .env file in the project root with the required model and service credentials (Azure OpenAI endpoint, key, deployment name, API version, etc.). Optionally download pdf.js and place it under libs/ktem/ktem/assets/prebuilt to preview source PDF pages in the UI.
Run the application: python app.py The UI opens automatically in a browser.
Configuring Azure Document Intelligence
In Azure Portal create an Azure AI Document Intelligence resource (free tier for testing). Record the service Endpoint and Key . These values are added to the .env file.
Multimodal Document Loader
The loader
libs/kotaemon/kotaemon/loaders/azureai_document_intelligence_loader.pyperforms three main steps:
Instantiate an azure.ai.documentintelligence.DocumentIntelligenceClient using the endpoint and key.
Call client.begin_analyze_document with a chosen model (e.g., prebuilt-layout for layout extraction or prebuilt-read for plain text) and obtain the result in markdown format.
Post‑process the result to extract figures and tables, optionally sending each figure to a visual large model (VLM) for caption generation.
Figure Extraction and Captioning
For each figure the loader crops the image from the PDF, encodes it as a base64 data URL, and calls generate_single_figure_caption which invokes a VLM (by default Azure OpenAI gpt‑4o‑mini) to produce a Chinese description. The caption and image metadata are stored in a Document object.
def client_(self):
try:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
except ImportError:
raise ImportError("Please install azure-ai-documentintelligence")
return DocumentIntelligenceClient(self.endpoint, AzureKeyCredential(self.credential))
# Simplified figure processing loop
for figure_desc in result.get("figures", []):
if not self.vlm_endpoint:
continue
# crop image, encode to base64, generate caption, build metadata
...Table Extraction
Tables are returned in the tables field as markdown. The loader uses the offset and length information to slice the original markdown text and creates a Document with metadata indicating the page number and type “table”.
offset = table_desc["spans"][0]["offset"]
length = table_desc["spans"][0]["length"]
table_metadata = {
"type": "table",
"page_label": page_number,
"table_origin": text_content[offset: offset + length],
}
tables.append(Document(text=text_content[offset: offset + length],
metadata=table_metadata))Putting It All Together
After extracting text, figures, and tables, Kotaemon builds a unified list of Document objects. These are then split, embedded, and indexed for vector search, enabling end‑to‑end RAG over multimodal content.
Conclusion
Kotaemon demonstrates how to combine Azure Document Intelligence and a visual large model to parse complex PDFs, extract images and tables, generate semantic captions, and feed everything into a RAG pipeline. The approach is fully extensible—developers can replace the VLM, adjust the Azure model, or modify the loader code to suit their own applications.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
