Artificial Intelligence 13 min read

Step‑by‑Step Guide: Building a PDF‑Based RAG Knowledge Base with LangChain, Streamlit, DashScope & DeepSeek

This tutorial shows how to create a lightweight Retrieval‑Augmented Generation (RAG) system that indexes multiple PDF files, stores their embeddings in a FAISS vector database, and answers user queries through a LangChain agent powered by DashScope embeddings and the DeepSeek‑Chat model, all wrapped in a Streamlit UI.

Fun with Large Models

Aug 22, 2025

Step‑by‑Step Guide: Building a PDF‑Based RAG Knowledge Base with LangChain, Streamlit, DashScope & DeepSeek

In this tutorial we walk through building a lightweight Retrieval‑Augmented Generation (RAG) knowledge‑base system that indexes PDF documents and answers questions using a large language model.

1. Environment setup

Create an Anaconda virtual environment langchainenv and install required packages:

pip install streamlit PyPDF2 dashscope faiss-cpu

Run the Streamlit app with streamlit run langchain搭建pdf解析rag系统.py (listening on port 8501).

2. Core LangChain logic

The script imports Streamlit, PyPDF2, LangChain components, FAISS, DashScope embeddings and DeepSeek chat model. It defines functions to read PDFs, split text into 1000‑token chunks with 200‑token overlap, embed chunks using DashScopeEmbeddings(model="text-embedding-v1"), and store vectors in a local FAISS index.

def pdf_read(pdf_doc):
    text = ""
    for pdf in pdf_doc:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text


def get_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    return text_splitter.split_text(text)


def vector_store(text_chunks):
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    vector_store.save_local("faiss_db")

After the vector store is built, a LangChain retriever is created from the FAISS index and wrapped with create_retriever_tool. A ChatPromptTemplate supplies system instructions, chat history, and an agent_scratchpad. The agent is assembled with create_tool_calling_agent and executed via AgentExecutor to produce answers.

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an AI assistant, answer based on provided context. If the answer is not in the context, say \"答案不在上下文中\"."),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])
agent = create_tool_calling_agent(llm, [tools], prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
response = agent_executor.invoke({"input": query})

3. UI with Streamlit

The front‑end displays a title, database status, PDF uploader (multiple files allowed), and a submit button that triggers the processing pipeline: read PDFs → split → embed → store. After the database is ready, an input box lets the user ask questions; the app calls user_input which loads the FAISS index, builds the retriever tool, and runs the conversational chain.

if process_button:
    raw_text = pdf_read(pdf_doc)
    text_chunks = get_chunks(raw_text)
    vector_store(text_chunks)
    st.success("✅ PDF processing complete")

The UI also provides a sidebar for database management (clear database) and displays helpful messages and progress spinners.

4. Running the system

After uploading a PDF (e.g., the DeepSeek deployment guide) the system reports the number of generated chunks (e.g., 37). Querying “What are the deployment options for DeepSeek‑R1?” shows the retrieved context and the model’s answer, confirming that the RAG pipeline works end‑to‑end.

5. Conclusion

The article demonstrates that LangChain dramatically reduces the effort required to build a RAG agent by providing ready‑made wrappers for text splitting, embedding, vector storage, and tool‑calling agents. The complete code can be run locally and extended to multiple PDFs or other document types.

Python LangChain RAG FAISS DeepSeek Streamlit DashScope

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.