Artificial Intelligence 12 min read

How to Extract and Embed Tables and Images from PDFs for Multimodal RAG

This article explains a practical approach to parsing PDFs containing text, tables, and images, using the open‑source Unstructured library and LlaVA model, then embedding each modality into a vector store with multi‑vector retrieval to enable accurate semantic search in private‑knowledge RAG pipelines, with optional LangChain integration.

AI Large Model Application Practice

Oct 18, 2023

How to Extract and Embed Tables and Images from PDFs for Multimodal RAG

Background

OpenAI recently released GPT‑4‑V, a multimodal version of GPT‑4 that can understand both text and images. When building LLM applications that rely on private knowledge bases (RAG), we often need to process PDFs that contain not only plain text but also tables and images. Traditional text‑only pipelines cannot handle these modalities effectively.

Typical Private‑Knowledge RAG Flow

In a standard RAG setup, documents are split, embedded, and stored in a vector database for semantic retrieval. The diagram below (originally in the source) illustrates this flow.

Problem Statement

When a PDF contains tables or images, naïve splitting either loses important information (tables) or provides no semantic representation for images. We need a pipeline that can extract, caption, embed, and retrieve these multimodal elements.

Proposed Solution Overview

Use the open‑source Unstructured library to parse PDFs and extract three modality streams: plain text, tables, and images.

Generate semantic captions for extracted images using the open‑source multimodal model LlaVA , turning visual content into searchable text.

Store all modalities in a multi‑vector store that links raw images, their captions, and associated text/table data, enabling both precise semantic search and context completeness.

Step 1 – Unstructured Document Parsing

The unstructured library supports PDF, Office, HTML, and other formats. It requires poppler-utils for PDF rendering and tesseract‑ocr for OCR. Parameters such as file name, maximum text block size, table detection flag, image handling options, and output directories control the extraction process. The library returns an array of elements, each with a type (e.g., text, table, image) and corresponding content. Images are saved as separate files, while text and tables are kept in the array.

Step 2 – Image Captioning and Embedding

Each extracted image is fed to LlaVA (or a similar multimodal LLM) to produce a detailed textual summary. The caption can then be embedded using any text embedding model (e.g., OpenAI embeddings) and stored in a vector database. The original image file is kept in a memory store, and its unique ID is recorded in the vector metadata to enable later retrieval.

Step 3 – Retrieval Strategies

Two possible query‑time workflows are described:

Method 1 (text‑only): Perform semantic search on image captions; if a relevant caption is found, pass the caption text to a standard LLM for answer generation.

Method 2 (multimodal): Retrieve the caption, use its stored image ID to fetch the original image, and feed both the question and the raw image to a multimodal LLM (e.g., LlaVA or GPT‑4‑V) for richer responses.

Both methods rely on the same multi‑vector store, but Method 2 requires an additional step of loading the image and invoking a multimodal model.

Multi‑Vector Storage Concept

The approach stores “large chunks” (e.g., whole images or full text sections) for context completeness and “small chunks” (e.g., captions or table excerpts) for precise semantic matching. During retrieval, the vector store returns the small chunks; their metadata links back to the large chunks, which are then supplied to the LLM.

Implementation Example (LangChain‑Free)

# vectorstore: stores text summaries for semantic search
vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings()
)

# memorystore: keeps original images (or large text blocks)
store = InMemoryStore()

# Retriever configuration
retriever = (
    vectorstore=vectorstore,
    docstore=store,
    id_key="doc_id"
)

# Assume cleaned_img_summary is a list of image captions
img_ids = [str(uuid.uuid4()) for _ in cleaned_img_summary]
summary_img = [Document(page_content=s, metadata={"doc_id": img_ids[i]})
               for i, s in enumerate(cleaned_img_summary)]

# Add image captions to the vector store
retriever.vectorstore.add_documents(summary_img)

# Associate raw images with their IDs in the memory store
retriever.docstore.mset(list(zip(img_ids, "### image ###")))

# Later, retrieve relevant documents for a user question
retriever.get_relevant_documents("### question ###")

Conclusion

The described pipeline demonstrates how to handle multimodal PDF content—text, tables, and images—by leveraging open‑source tools (Unstructured, LlaVA) and a multi‑vector retrieval architecture. While the example uses LangChain components for convenience, the same principles apply to custom implementations, enabling robust private‑knowledge RAG systems that can fully exploit visual information once multimodal LLM APIs (e.g., GPT‑4‑V) become publicly available.

LLM LangChain RAG Embeddings PDF processing

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.