Advanced RAG with Semi‑Structured Data Using LangChain, Unstructured, and ChromaDB
This tutorial demonstrates how to build an advanced Retrieval‑Augmented Generation (RAG) system for semi‑structured PDF data by leveraging LangChain, the unstructured library, ChromaDB vector store, and OpenAI models, covering installation, PDF partitioning, element classification, summarization, and query execution.
Preface
RAG (Retrieval‑Augmented Generation) is a natural‑language‑processing technique that combines retrieval (vector databases) with generative AI models to improve information‑retrieval quality.
Naive RAG
Naive RAG refers to the most basic retrieve‑and‑generate pipeline, which includes document chunking, embedding, and semantic similarity search based on user queries. While simple, its performance and quality are limited, motivating the move to Advanced RAG.
Semi‑Structured Data
Semi‑structured data lies between structured and unstructured data, mixing tabular formats with free‑form text, images, or other media. Examples include PDF statements that contain text, tables, and figures. Handling such data requires both SQL‑like processing for the structured parts and embedding‑based retrieval for the unstructured parts.
The demo uses the unstructured package to create custom pipelines for processing these elements, LangChain to orchestrate the RAG workflow, and ChromaDB as the vector store.
Nvidia Equity Change Statement
The example PDF is an Nvidia equity‑change declaration, chosen for its compact size and mix of structured tables and unstructured text.
Practical Steps
Install required Python packages: !pip install langchain unstructured[all-docs] pydantic lxml openai chromadb tiktoken -q -U
Download the PDF and name it statement_of_changes.pdf : !wget -o statement_of_changes.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf
Install system utilities for PDF extraction and OCR (poppler‑utils, tesseract‑ocr): !apt-get install poppler-utils tesseract-ocr
Set the OpenAI API key: import os os.environ["OPENAI_API_KEY"] = ""
Partition the PDF into elements using unstructured.partition_pdf with parameters that infer table structure and chunk by title. from typing import Any from pydantic import BaseModel from unstructured.partition.pdf import partition_pdf raw_pdf_elements = partition_pdf( filename = "statement_of_changes.pdf", extract_images_in_pdf=False, infer_table_structure=True, chunking_strategy = "by_title", max_characters=4000, new_after_n_chars=3000, combine_text_under_n_chars=2000, image_output_dir_path="." )
Count element categories to understand the composition of the document. category_counts = {} for element in raw_pdf_elements: category = str(type(element)) if category in category_counts: category_counts[category] += 1 else: category_counts[category] = 1 unique_categories = set(category_counts.keys()) category_counts
Separate table and text elements into distinct lists. class Element(BaseModel): type: str text: Any table_elements = [] text_elements = [] for element in raw_pdf_elements: if "unstructured.documents.elemnts.Table" in str(type(element)): table_elements.append(Element(type="table", text=str(element))) elif "unstructured.documents.elments.CompositeElement" in str(type(element)): text_elements.append(Element(type="text", text=str(element))) print(len(table_elements)) print(len(text_elements))
Summarize each element using a LangChain chain. from langchain.chat_models import ChatOpenAI from langchain.prompts import ChatPromptTemplate from langchain.schema.output_parser import StrOutputParser prompt_text = """ You are responsible for concisely summarizing table or text chunk. {element} """ prompt = ChatPromptTemplate.from_template(prompt_text) model = ChatOpenAI(temperature=0, model="gpt-4") summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser # Summarize tables tables = [i.text for i in table_elements] table_summarizes = summarize_chain.batch(tables, {"max_concurrency": 5}) # Summarize texts texts = [i.text for i in text_elements] text_summarizes = summarize_chain.batch(texts, {"max_concurrency": 5})
Build a MultiVectorRetriever that links summaries (vectors) with original documents via a shared ID. import uuid from langchain.embeddings import OpenAIEmbeddings from langchain.schema.document import Document from langchain.storage import InMemoryStore from langchain.vectorstores import Chroma from langchain.retrievers import MultiVectorRetriever vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings()) store = InMemoryStore() id_key = "doc_id" retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=store, id_key=id_key) # Text documents doc_ids = [str(uuid.uuid4()) for _ in texts] summary_texts = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summarizes)] retriever.vectorstore.add_documents(summary_texts) retriever.docstore.mset(list(zip(doc_ids, texts))) # Table documents table_ids = [str(uuid.uuid4()) for _ in tables] summary_tables = [Document(page_content=s, metadata={id_key: table_ids[i]}) for i, s in enumerate(table_summarizes)] retriever.vectorstore.add_documents(summary_tables) retriever.docstore.mset(list(zip(table_ids, tables)))
Create the final chain that takes a user question, retrieves relevant context, and generates an answer. from langchain.schema.runnable import RunnablePassthrough template = """Answer the question based only on the following context, which can include text and tables:\n{context}\nQuestion: {question}\n""" prompt = ChatPromptTemplate.from_template(template) model = ChatOpenAI(temperature=0, model="gpt-4") chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() )
Execute a sample query. chain.invoke("How many stocks were disposed? Who is the beneficial owner?")
Summary
MultiVectorRetriever for linking summaries with original documents.
Unstructured library for parsing semi‑structured PDFs.
ChromaDB and InMemoryStore for vector storage and document retrieval.
References
"RAG with Semi‑Structured Data" (Episode 01 of the series).
Nvidia equity‑change statement PDF.
Source code repository.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.