Advanced RAG with Semi‑Structured Data Using LangChain, Unstructured, and ChromaDB
This tutorial demonstrates how to build an advanced Retrieval‑Augmented Generation (RAG) system for semi‑structured PDF data by leveraging LangChain, the unstructured library, ChromaDB vector store, and OpenAI models, covering installation, PDF partitioning, element classification, summarization, and query execution.
Preface
RAG (Retrieval‑Augmented Generation) is a natural‑language‑processing technique that combines retrieval (vector databases) with generative AI models to improve information‑retrieval quality.
Naive RAG
Naive RAG refers to the most basic retrieve‑and‑generate pipeline, which includes document chunking, embedding, and semantic similarity search based on user queries. While simple, its performance and quality are limited, motivating the move to Advanced RAG.
Semi‑Structured Data
Semi‑structured data lies between structured and unstructured data, mixing tabular formats with free‑form text, images, or other media. Examples include PDF statements that contain text, tables, and figures. Handling such data requires both SQL‑like processing for the structured parts and embedding‑based retrieval for the unstructured parts.
The demo uses the unstructured package to create custom pipelines for processing these elements, LangChain to orchestrate the RAG workflow, and ChromaDB as the vector store.
Nvidia Equity Change Statement
The example PDF is an Nvidia equity‑change declaration, chosen for its compact size and mix of structured tables and unstructured text.
Practical Steps
Install required Python packages:
!pip install langchain unstructured[all-docs] pydantic lxml openai chromadb tiktoken -q -UDownload the PDF and name it statement_of_changes.pdf:
!wget -o statement_of_changes.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdfInstall system utilities for PDF extraction and OCR (poppler‑utils, tesseract‑ocr): !apt-get install poppler-utils tesseract-ocr Set the OpenAI API key:
import os
os.environ["OPENAI_API_KEY"] = ""Partition the PDF into elements using unstructured.partition_pdf with parameters that infer table structure and chunk by title.
from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
raw_pdf_elements = partition_pdf(
filename = "statement_of_changes.pdf",
extract_images_in_pdf=False,
infer_table_structure=True,
chunking_strategy = "by_title",
max_characters=4000,
new_after_n_chars=3000,
combine_text_under_n_chars=2000,
image_output_dir_path="."
)Count element categories to understand the composition of the document.
category_counts = {}
for element in raw_pdf_elements:
category = str(type(element))
if category in category_counts:
category_counts[category] += 1
else:
category_counts[category] = 1
unique_categories = set(category_counts.keys())
category_countsSeparate table and text elements into distinct lists.
class Element(BaseModel):
type: str
text: Any
table_elements = []
text_elements = []
for element in raw_pdf_elements:
if "unstructured.documents.elemnts.Table" in str(type(element)):
table_elements.append(Element(type="table", text=str(element)))
elif "unstructured.documents.elments.CompositeElement" in str(type(element)):
text_elements.append(Element(type="text", text=str(element)))
print(len(table_elements))
print(len(text_elements))Summarize each element using a LangChain chain.
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
prompt_text = """
You are responsible for concisely summarizing table or text chunk.
{element}
"""
prompt = ChatPromptTemplate.from_template(prompt_text)
model = ChatOpenAI(temperature=0, model="gpt-4")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser
# Summarize tables
tables = [i.text for i in table_elements]
table_summarizes = summarize_chain.batch(tables, {"max_concurrency": 5})
# Summarize texts
texts = [i.text for i in text_elements]
text_summarizes = summarize_chain.batch(texts, {"max_concurrency": 5})Build a MultiVectorRetriever that links summaries (vectors) with original documents via a shared ID.
import uuid
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma
from langchain.retrievers import MultiVectorRetriever
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=store, id_key=id_key)
# Text documents
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summarizes)]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))
# Table documents
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [Document(page_content=s, metadata={id_key: table_ids[i]}) for i, s in enumerate(table_summarizes)]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))Create the final chain that takes a user question, retrieves relevant context, and generates an answer.
from langchain.schema.runnable import RunnablePassthrough
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model="gpt-4")
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)Execute a sample query.
chain.invoke("How many stocks were disposed? Who is the beneficial owner?")Summary
MultiVectorRetriever for linking summaries with original documents.
Unstructured library for parsing semi‑structured PDFs.
ChromaDB and InMemoryStore for vector storage and document retrieval.
References
"RAG with Semi‑Structured Data" (Episode 01 of the series).
Nvidia equity‑change statement PDF.
Source code repository.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
