22 min read

Multimodal RAG: A Complete Guide to Ingesting Images, Tables, and PDFs

This article examines the blind spot of pure‑text RAG for visual content, compares three multimodal ingestion strategies—CLIP embeddings, image‑to‑text captioning with a MultiVectorRetriever, and ColPali visual retrieval—covers table‑specific handling, presents end‑to‑end TypeScript implementations, and lists common pitfalls to avoid when deploying production‑grade multimodal RAG pipelines.

James' Growth Diary

May 13, 2026

Multimodal RAG: A Complete Guide to Ingesting Images, Tables, and PDFs

Real‑world documents often contain architecture diagrams, financial tables, screenshots, and flowcharts. Pure‑text RAG loaders such as UnstructuredLoader or PDFLoader extract only the textual layer, send tables to OCR (which frequently loses layout), and either discard images or store them without any downstream processing, making visual information invisible to the retriever.

Visual blind spot of pure‑text RAG

When a PDF is loaded, the pipeline is effectively:

PDF → OCR/parse → text chunks → vectorise → retrieve

OCR error rates of 5‑15 % and broken table structures cause critical knowledge—such as component relationships in architecture diagrams or quarterly growth numbers in tables—to be omitted from the knowledge base.

Three main solution paths

Multimodal embedding (CLIP) – map images and text into a shared vector space; accuracy ≈ 60 %; low cost; suited for image‑similarity search.

Image‑to‑text caption (Caption) – use a vision‑language model (VLM) to generate a detailed textual description of each image, index the caption, and retrieve the original image for answer generation; accuracy ≈ 90 %; medium‑high cost; best for precise document Q&A.

ColPali visual retrieval – render each PDF page as an image and index it directly with a late‑interaction visual transformer, eliminating OCR; highest accuracy; medium cost; ideal for complex layouts.

Solution 1 – CLIP multimodal embedding

CLIP (OpenAI 2021) learns a joint embedding where semantically similar text and images are close in a high‑dimensional space. A TypeScript implementation using LangChain, ChromaDB, and OpenCLIP stores both images and text in the same vector store and retrieves images with a textual query.

import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenCLIPEmbeddings } from "@langchain/community/embeddings/openclip";
import { ChatOpenAI } from "@langchain/openai";
import * as fs from "fs";
import * as path from "path";

// 1. Initialise multimodal vector store
const embeddings = new OpenCLIPEmbeddings();
const vectorStore = await Chroma.fromExistingCollection(embeddings, { collectionName: "multimodal_docs" });

// 2. Add images and a sample text document
const imageUris = fs.readdirSync("./docs/images")
  .filter(f => f.endsWith(".png") || f.endsWith(".jpg"))
  .map(f => path.join("./docs/images", f));
await vectorStore.addImages(imageUris);
await vectorStore.addDocuments([{ pageContent: "Q3 enterprise sales drove a 15% revenue increase", metadata: { type: "text" } }]);

// 3. Retrieve matching images with a text query
const retriever = vectorStore.asRetriever({ k: 3 });
const results = await retriever.invoke("Q3 revenue growth trend");

// 4. Send retrieved images to a VLM for answer generation
const visionModel = new ChatOpenAI({ model: "gpt-4o" });
for (const doc of results) {
  if (doc.metadata.type === "image") {
    const imageBase64 = fs.readFileSync(doc.metadata.uri, "base64");
    const response = await visionModel.invoke([
      { role: "user", content: [
        { type: "text", text: "Analyze this chart and summarize the core trend" },
        { type: "image_url", image_url: { url: `data:image/png;base64,${imageBase64}` } }
      ] }
    ]);
    console.log(response.content);
  }
}

Limitation : CLIP produces a single global vector per image, discarding fine‑grained details such as numbers in a table. This results in roughly 60 % accuracy for detailed queries.

Solution 2 – Image‑to‑text caption (Caption)

The image itself is not indexed. Instead, a VLM (e.g., GPT‑4o or Claude) generates a detailed caption that contains type, key values, trends, and relationships. The caption is indexed for semantic retrieval; the original image is fetched only for final answer generation.

import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { InMemoryStore } from "@langchain/core/stores";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { Document } from "@langchain/core/documents";
import { v4 as uuidv4 } from "uuid";
import * as fs from "fs";

const visionModel = new ChatOpenAI({ model: "gpt-4o" });
const embeddings = new OpenAIEmbeddings();

// Step 1: generate a caption for each image
async function generateImageCaption(imagePath) {
  const imageBase64 = fs.readFileSync(imagePath, "base64");
  const response = await visionModel.invoke([
    { role: "user", content: [
      { type: "text", text: "Please describe this image in detail (type, main data, conclusions) within 100‑200 words." },
      { type: "image_url", image_url: { url: `data:image/png;base64,${imageBase64}` } }
    ] }
  ]);
  return response.content;
}

const imageFiles = ["./docs/arch_diagram.png", "./docs/q3_revenue.png"];
const captions = await Promise.all(imageFiles.map(generateImageCaption));

// Step 2: build a MultiVectorRetriever (captions in vector store, raw paths in doc store)
const vectorStore = await Chroma.fromDocuments([], embeddings, { collectionName: "image_captions" });
const docStore = new InMemoryStore();
const retriever = new MultiVectorRetriever({ vectorstore: vectorStore, byteStore: docStore, idKey: "doc_id" });

const ids = imageFiles.map(() => uuidv4());
const captionDocs = captions.map((c, i) => new Document({ pageContent: c, metadata: { doc_id: ids[i] } }));
await retriever.vectorstore.addDocuments(captionDocs);
await retriever.byteStore.mset(ids.map((id, i) => [id, new TextEncoder().encode(imageFiles[i]) ]));

// Step 3: query – the retriever returns the original image path
const results = await retriever.invoke("What are the core components shown in the architecture diagram?");
// Use the returned path to load the image and send it to the VLM for the final answer.

Cost consideration : Generating captions for 1 000 images with GPT‑4o costs roughly $20‑$50. Low‑quality captions (e.g., “image with numbers”) provide no retrieval benefit.

Solution 3 – ColPali visual retrieval

Both previous approaches require extracting images from PDFs, which fails on multi‑column layouts, scanned PDFs, or complex tables. ColPali (PaliGemma + visual post‑processing) skips extraction: each PDF page is rendered as an image and indexed with a late‑interaction mechanism similar to ColBERT, comparing each query token with visual patches.

Traditional pipeline:

PDF → OCR/parse → text chunks → vectorise → retrieve

ColPali pipeline:

PDF → render page image → visual transformer → multi‑vector index → retrieve

This preserves layout and eliminates OCR errors (5‑15 % error rate). The late‑interaction design yields higher accuracy than CLIP’s global vector.

import { ChatOpenAI } from "@langchain/openai";
import * as fs from "fs";

// Index PDF pages with a locally hosted ColPali service (GPU ≥ 16 GB VRAM required)
async function indexPdfWithColPali(pdfPath) {
  const pageImages = []; // base64‑encoded page images generated with pdf2pic / pdfjs
  const resp = await fetch("http://localhost:8000/index", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ images: pageImages, collection_id: "my_docs" })
  });
  const { doc_ids } = await resp.json();
  return doc_ids;
}

async function queryWithColPali(query) {
  const resp = await fetch("http://localhost:8000/search", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ query, collection_id: "my_docs", top_k: 3 })
  });
  const { results } = await resp.json(); // array of base64 images
  const visionModel = new ChatOpenAI({ model: "gpt-4o" });
  const answer = await visionModel.invoke([
    { role: "user", content: [
      { type: "text", text: query },
      ...results.map(img => ({ type: "image_url", image_url: { url: `data:image/png;base64,${img}` } }))
    ] }
  ]);
  return answer.content;
}

await indexPdfWithColPali("./technical_report.pdf");
const answer = await queryWithColPali("What core modules are shown in the architecture diagram of chapter 3?");

Practical limits : Running PaliGemma needs a GPU with at least 16 GB VRAM. Vector databases such as Vespa and Milvus are adding native ColPali support, which will lower deployment barriers.

Special handling for tables – preserve structural information

Plain‑text extraction flattens rows and columns, e.g.:

Q1 Q2 Q3 Q4 / Revenue 120 145 178 203 / Growth - 20.8% 22.8% 14.0%

Two recommended strategies:

Strategy 1 – use unstructured to extract HTML tables (preferred) :

import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";

const loader = new UnstructuredLoader("./financial_report.pdf", {
  strategy: "hi_res",
  extractImageBlockTypes: ["Table"],
  inferTableStructure: true,
  outputFormat: "application/json"
});
const docs = await loader.load();
const tables = docs.filter(d => d.metadata.type === "Table");
// Store tables separately with metadata (source_page, table_index) for filtered retrieval

Strategy 2 – send scanned‑image tables directly to a VLM :

async function extractTableFromImage(imagePath) {
  const visionModel = new ChatOpenAI({ model: "gpt-4o" });
  const imageBase64 = fs.readFileSync(imagePath, "base64");
  const response = await visionModel.invoke([
    { role: "user", content: [
      { type: "text", text: "Output this table in Markdown, preserving all rows, columns, and values." },
      { type: "image_url", image_url: { url: `data:image/png;base64,${imageBase64}` } }
    ] }
  ]);
  return response.content;
}

Key principle: never mix tables with regular text in the same chunk; store tables separately with their own metadata to improve retrieval precision.

Full multimodal RAG pipeline – unified handling of text, tables, and images

The three solutions can be combined into a production‑grade pipeline. Summaries (captions or extracted text) are indexed in a vector store for high‑quality retrieval; raw images and raw table HTML are kept in a separate document store for answer generation.

import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { InMemoryStore } from "@langchain/core/stores";
import { Document } from "@langchain/core/documents";
import { v4 as uuidv4 } from "uuid";
import * as fs from "fs";

async function buildMultimodalKB(pdfPath) {
  const visionModel = new ChatOpenAI({ model: "gpt-4o" });
  const embeddings = new OpenAIEmbeddings();

  // Step 1: extract all content types with unstructured
  const loader = new UnstructuredLoader(pdfPath, {
    strategy: "hi_res",
    extractImageBlockTypes: ["Image", "Table"],
    inferTableStructure: true,
  });
  const rawDocs = await loader.load();

  // Step 2: process each doc into (summary, raw) pairs
  const pairs = [];
  for (const doc of rawDocs) {
    if (doc.metadata.type === "Image") {
      const imagePath = doc.metadata.image_path;
      const imageBase64 = fs.readFileSync(imagePath, "base64");
      const captionResp = await visionModel.invoke([
        { role: "user", content: [
          { type: "text", text: "Provide a detailed description of this image (type, main content, key numbers or conclusions) in 100‑200 words." },
          { type: "image_url", image_url: { url: `data:image/png;base64,${imageBase64}` } }
        ] }
      ]);
      pairs.push({ summary: captionResp.content, raw: imagePath, type: "image" });
    } else if (doc.metadata.type === "Table") {
      pairs.push({ summary: doc.pageContent, raw: doc.pageContent, type: "table" });
    } else {
      pairs.push({ summary: doc.pageContent, raw: doc.pageContent, type: "text" });
    }
  }

  // Step 3: index summaries into vector store, raw content into doc store
  const vectorStore = new Chroma(embeddings, { collectionName: "multimodal_kb" });
  const docStore = new InMemoryStore();
  const retriever = new MultiVectorRetriever({ vectorstore: vectorStore, byteStore: docStore, idKey: "doc_id" });

  const ids = pairs.map(() => uuidv4());
  const summaryDocs = pairs.map((p, i) => new Document({ pageContent: p.summary, metadata: { doc_id: ids[i], type: p.type } }));
  await retriever.vectorstore.addDocuments(summaryDocs);
  await retriever.byteStore.mset(ids.map((id, i) => [id, new TextEncoder().encode(JSON.stringify({ type: pairs[i].type, raw: pairs[i].raw }))]));

  return retriever;
}

During query time, the retriever returns the appropriate content type: image paths are sent to the VLM, while text and table summaries are used directly as context.

Common pitfalls that can break a multimodal RAG deployment

Pitfall 1: Storing base64‑encoded images in vector‑store metadata creates huge payloads (500 KB‑2 MB per image) and degrades performance. Store only file paths or URLs.

Pitfall 2: Low‑quality captions (e.g., "image with numbers") provide no retrieval benefit. Captions must include type, specific values, trend description, and conclusions.

Pitfall 3: Over‑compressing images (e.g., 200×200) removes critical details. A balanced size such as 1024×768 preserves readability while controlling token usage.

Pitfall 4: Treating scanned PDFs as plain text yields empty documents. Run OCR in high‑precision mode or use ColPali.

Pitfall 5: Using a single retrieval strategy for all three content types. Tables benefit from exact BM25, captions from semantic vector search, and long text from hybrid methods.

Pitfall 6: Generating captions without rate‑limit handling causes API throttling. Batch generation (10‑20 images per batch) with retries is essential.

Summary of trade‑offs

CLIP embedding : simple, low cost, ~60 % accuracy, suitable for image similarity but not fine‑grained Q&A.

Image‑to‑text caption + MultiVectorRetriever : highest accuracy (~90 %), text‑based retrieval with original image for answering; higher ingestion cost.

ColPali visual retrieval : zero OCR error, best for complex layouts; requires GPU deployment (≥ 16 GB VRAM).

Table handling : extract HTML tables with unstructured or send screenshots to a VLM; never mix tables with regular text chunks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TypeScript LangChain CLIP Image Captioning ColPali Multimodal RAG Vector Store Table Extraction

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Visual blind spot of pure‑text RAG

Three main solution paths

Solution 1 – CLIP multimodal embedding

Solution 2 – Image‑to‑text caption (Caption)

Solution 3 – ColPali visual retrieval

Special handling for tables – preserve structural information

Full multimodal RAG pipeline – unified handling of text, tables, and images

Common pitfalls that can break a multimodal RAG deployment

Summary of trade‑offs

James' Growth Diary

How this landed with the community

Was this worth your time?

0 Comments

Solution 1 – CLIP multimodal embedding

Solution 2 – Image‑to‑text caption (Caption)

Solution 3 – ColPali visual retrieval