Artificial Intelligence 11 min read

Advanced RAG with Semi‑Structured Data Using LangChain, Unstructured, and ChromaDB

This tutorial demonstrates how to build an advanced Retrieval‑Augmented Generation (RAG) system for semi‑structured PDF data by leveraging LangChain, the unstructured library, ChromaDB vector store, and OpenAI models, covering installation, PDF partitioning, element classification, summarization, and query execution.

Rare Earth Juejin Tech Community

Jan 31, 2024

Advanced RAG with Semi‑Structured Data Using LangChain, Unstructured, and ChromaDB

Preface

RAG (Retrieval‑Augmented Generation) is a natural‑language‑processing technique that combines retrieval (vector databases) with generative AI models to improve information‑retrieval quality.

Naive RAG

Naive RAG refers to the most basic retrieve‑and‑generate pipeline, which includes document chunking, embedding, and semantic similarity search based on user queries. While simple, its performance and quality are limited, motivating the move to Advanced RAG.

Semi‑Structured Data

Semi‑structured data lies between structured and unstructured data, mixing tabular formats with free‑form text, images, or other media. Examples include PDF statements that contain text, tables, and figures. Handling such data requires both SQL‑like processing for the structured parts and embedding‑based retrieval for the unstructured parts.

The demo uses the unstructured package to create custom pipelines for processing these elements, LangChain to orchestrate the RAG workflow, and ChromaDB as the vector store.

Nvidia Equity Change Statement

The example PDF is an Nvidia equity‑change declaration, chosen for its compact size and mix of structured tables and unstructured text.

Practical Steps

Install required Python packages:

!pip install langchain unstructured[all-docs] pydantic lxml openai chromadb tiktoken -q -U

Download the PDF and name it statement_of_changes.pdf:

!wget -o statement_of_changes.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf

Install system utilities for PDF extraction and OCR (poppler‑utils, tesseract‑ocr): !apt-get install poppler-utils tesseract-ocr Set the OpenAI API key:

import os
os.environ["OPENAI_API_KEY"] = ""

Partition the PDF into elements using unstructured.partition_pdf with parameters that infer table structure and chunk by title.

from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename = "statement_of_changes.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy = "by_title",
    max_characters=4000,
    new_after_n_chars=3000,
    combine_text_under_n_chars=2000,
    image_output_dir_path="."
)

Count element categories to understand the composition of the document.

category_counts = {}
for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1
unique_categories = set(category_counts.keys())
category_counts

Separate table and text elements into distinct lists.

class Element(BaseModel):
    type: str
    text: Any

table_elements = []
text_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elemnts.Table" in str(type(element)):
        table_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elments.CompositeElement" in str(type(element)):
        text_elements.append(Element(type="text", text=str(element)))
    print(len(table_elements))
    print(len(text_elements))

Summarize each element using a LangChain chain.

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

prompt_text = """
You are responsible for concisely summarizing table or text chunk.
{element}
"""
prompt = ChatPromptTemplate.from_template(prompt_text)
model = ChatOpenAI(temperature=0, model="gpt-4")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser

# Summarize tables
tables = [i.text for i in table_elements]
table_summarizes = summarize_chain.batch(tables, {"max_concurrency": 5})
# Summarize texts
texts = [i.text for i in text_elements]
text_summarizes = summarize_chain.batch(texts, {"max_concurrency": 5})

Build a MultiVectorRetriever that links summaries (vectors) with original documents via a shared ID.

import uuid
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma
from langchain.retrievers import MultiVectorRetriever

vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=store, id_key=id_key)

# Text documents
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summarizes)]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Table documents
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [Document(page_content=s, metadata={id_key: table_ids[i]}) for i, s in enumerate(table_summarizes)]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

Create the final chain that takes a user question, retrieves relevant context, and generates an answer.

from langchain.schema.runnable import RunnablePassthrough

template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model="gpt-4")

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

Execute a sample query.

chain.invoke("How many stocks were disposed? Who is the beneficial owner?")

Summary

MultiVectorRetriever for linking summaries with original documents.

Unstructured library for parsing semi‑structured PDFs.

ChromaDB and InMemoryStore for vector storage and document retrieval.

References

"RAG with Semi‑Structured Data" (Episode 01 of the series).

Nvidia equity‑change statement PDF.

Source code repository.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AI LangChain RAG ChromaDB Semi‑structured Data Unstructured

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.