Build a Multimodal RAG Pipeline with Kotaemon, Azure Document Intelligence, and VLM

This guide walks through setting up the open‑source Kotaemon framework, configuring Azure Document Intelligence and a visual large model, and implementing code to extract and caption images and tables from PDFs for end‑to‑end multimodal RAG applications.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
Build a Multimodal RAG Pipeline with Kotaemon, Azure Document Intelligence, and VLM

Overview

Kotaemon is an open‑source RAG framework that combines traditional RAG with GraphRAG and provides a UI front‑end. It supports multimodal document parsing, hybrid retrieval, and multiple generation modes such as complex‑question‑decomposition and Agentic RAG.

Key Features

Hybrid retrieval pipeline that mixes vector and full‑text search with a re‑ranking model.

Multiple generation/inference modes, including complex and Agentic RAG.

Built‑in support for multimodal documents (PDF tables, images) using Azure Document Intelligence.

Extensible pipeline with Microsoft GraphRAG integration.

Getting Started

Clone the repository and set up a conda environment:

# Create virtual environment
conda create -n kotaemon python=3.10
conda activate kotaemon

# Clone the code
git clone https://github.com/Cinnamon/kotaemon
cd kotaemon

# Install dependencies
pip install -e "libs/kotaemon[all]"
pip install -e "libs/ktem"

Create a .env file in the project root with the required model and service credentials (Azure OpenAI endpoint, key, deployment name, API version, etc.). Optionally download pdf.js and place it under libs/ktem/ktem/assets/prebuilt to preview source PDF pages in the UI.

Run the application: python app.py The UI opens automatically in a browser.

Configuring Azure Document Intelligence

In Azure Portal create an Azure AI Document Intelligence resource (free tier for testing). Record the service Endpoint and Key . These values are added to the .env file.

Multimodal Document Loader

The loader

libs/kotaemon/kotaemon/loaders/azureai_document_intelligence_loader.py

performs three main steps:

Instantiate an azure.ai.documentintelligence.DocumentIntelligenceClient using the endpoint and key.

Call client.begin_analyze_document with a chosen model (e.g., prebuilt-layout for layout extraction or prebuilt-read for plain text) and obtain the result in markdown format.

Post‑process the result to extract figures and tables, optionally sending each figure to a visual large model (VLM) for caption generation.

Figure Extraction and Captioning

For each figure the loader crops the image from the PDF, encodes it as a base64 data URL, and calls generate_single_figure_caption which invokes a VLM (by default Azure OpenAI gpt‑4o‑mini) to produce a Chinese description. The caption and image metadata are stored in a Document object.

def client_(self):
    try:
        from azure.ai.documentintelligence import DocumentIntelligenceClient
        from azure.core.credentials import AzureKeyCredential
    except ImportError:
        raise ImportError("Please install azure-ai-documentintelligence")
    return DocumentIntelligenceClient(self.endpoint, AzureKeyCredential(self.credential))

# Simplified figure processing loop
for figure_desc in result.get("figures", []):
    if not self.vlm_endpoint:
        continue
    # crop image, encode to base64, generate caption, build metadata
    ...

Table Extraction

Tables are returned in the tables field as markdown. The loader uses the offset and length information to slice the original markdown text and creates a Document with metadata indicating the page number and type “table”.

offset = table_desc["spans"][0]["offset"]
length = table_desc["spans"][0]["length"]
table_metadata = {
    "type": "table",
    "page_label": page_number,
    "table_origin": text_content[offset: offset + length],
}
tables.append(Document(text=text_content[offset: offset + length],
                     metadata=table_metadata))

Putting It All Together

After extracting text, figures, and tables, Kotaemon builds a unified list of Document objects. These are then split, embedded, and indexed for vector search, enabling end‑to‑end RAG over multimodal content.

Conclusion

Kotaemon demonstrates how to combine Azure Document Intelligence and a visual large model to parse complex PDFs, extract images and tables, generate semantic captions, and feed everything into a RAG pipeline. The approach is fully extensible—developers can replace the VLM, adjust the Azure model, or modify the loader code to suit their own applications.

PythonRAGVLMAzure
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.