Artificial Intelligence 13 min read

How to Build a Full‑Stack RAG Chatbot Using LangChain, FAISS & Langfuse

This guide walks through an end‑to‑end RAG implementation with LangChain, covering multi‑format document loading, recursive text splitting, embedding selection, FAISS vector storage, ConversationalRetrievalChain setup, prompt engineering, source citation, Langfuse observability, and best‑practice configuration management.

Huawei Cloud Developer Alliance

Mar 26, 2026

How to Build a Full‑Stack RAG Chatbot Using LangChain, FAISS & Langfuse

Overall Architecture

The project demonstrates a complete Retrieval‑Augmented Generation (RAG) pipeline built on the LangChain framework, from document ingestion to multi‑turn conversational answering, with LLM‑Ops monitoring via Langfuse.

1. Document Loading and Splitting

1.1 Multi‑format Loading

Different loaders preserve original structure and metadata (e.g., page number, source path):

# document_processor.py
import os
from langchain_community.document_loaders import PyPDFLoader, UnstructuredMarkdownLoader

def load_document(file_path: str):
    ext = os.path.splitext(file_path)[1].lower()
    if ext == ".pdf":
        loader = PyPDFLoader(file_path)  # Load page by page, automatically record page number to metadata
    elif ext == ".md":
        loader = UnstructuredMarkdownLoader(file_path)  # Preserve Markdown hierarchy
    else:
        raise ValueError(f"Unsupported file format: {ext}")
    return loader.load()

Note: Always verify that the returned list is non‑empty to avoid feeding empty documents into later stages.

1.2 Recursive Text Splitting

Effective chunking is crucial for retrieval quality. The RecursiveCharacterTextSplitter attempts splitting at semantic boundaries (paragraph → sentence → word) and creates overlapping chunks to retain context.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # maximum characters per chunk
    chunk_overlap=100,  # overlapping characters to preserve continuity
)
chunks = splitter.split_documents(docs)
# Remove pure‑whitespace chunks to prevent noise in the vector store
chunks = [c for c in chunks if c.page_content.strip()]

Parameter trade‑offs:

chunk_size : too small → incomplete semantics, context loss; too large → more noise, lower retrieval precision.

chunk_overlap : too small → information break at boundaries, redundant data increase; too large → vector store bloat.

Recommendation for Chinese technical documents: chunk_size=800~1200, chunk_overlap=80~150. For structured documents (tables, lists) a smaller size may be appropriate. Always filter out empty chunks to avoid zero‑vector pollution.

2. Vectorization and Storage

2.1 Embeddings Selection

The project uses the text-embedding-v3 model from DashScope, which aligns with Chinese language content.

# config.py
from langchain_community.embeddings import DashScopeEmbeddings

def get_embeddings():
    return DashScopeEmbeddings(
        model="text-embedding-v3",
        dashscope_api_key=DASHSCOPE_API_KEY,
    )

Note: Embedding model language should match the document language; mixing different embedding models is not allowed and requires rebuilding the entire vector store.

2.2 FAISS Vector Store

# vector_store.py
from langchain_community.vectorstores import FAISS

def create_vector_store(documents):
    embeddings = get_embeddings()
    vector_store = FAISS.from_documents(documents, embeddings)
    return vector_store

def get_retriever(vector_store, k=3):
    return vector_store.as_retriever(search_kwargs={"k": k})

FAISS builds an in‑memory index suitable for small‑to‑medium knowledge bases (up to tens of thousands of chunks). The k parameter controls how many top‑similar chunks are returned per query.

Recommended k: 3–5. Smaller values may miss relevant passages; larger values introduce noise and dilute LLM attention.

Persist the index to disk in production to avoid rebuilding after restarts:

vector_store.save_local("faiss_index")  # Save
FAISS.load_local("faiss_index", embeddings)  # Load

When documents are updated, use vector_store.add_documents(new_chunks) for incremental addition instead of full reconstruction.

3. Conversational Retrieval Chain

3.1 Chain Workflow

The ConversationalRetrievalChain combines multi‑turn dialogue with vector retrieval in two steps:

Question condensation: transform follow‑up queries into independent, self‑contained questions.

Retrieval + generation: fetch relevant chunks and let the LLM produce the final answer.

# Simplified workflow diagram
User question + chat history
│
▼ Step 1: Condense Question (Condense Question LLM)
│
▼ Step 2: Retrieve → Generate (Retriever → QA LLM)

3.2 Prompt Design

Question‑condensation prompt:

_CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(
"""Based on the following conversation history and the subsequent question, rewrite the question as a standalone query.
Conversation history:
{chat_history}
Follow‑up question: {question}
Standalone question:"""
)

QA prompt (excerpt): Ensure the LLM never fabricates information, always cites sources, and provides a fallback response when no relevant content is found.

Explicitly forbid hallucinations.

Require source attribution in the answer.

Provide a default reply such as "No relevant information found" when retrieval yields nothing.

3.3 Chain Assembly

# qa_chain.py
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory

def create_qa_chain(retriever):
    llm = get_llm()
    memory = ConversationBufferWindowMemory(
        k=5,  # keep recent 5 rounds of dialogue
        memory_key="chat_history",
        return_messages=True,  # return ChatMessage objects for compatibility with Chat models
        output_key="answer",  # only write answer field to memory, exclude source_documents
    )
    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=memory,
        condense_question_prompt=_CONDENSE_QUESTION_PROMPT,
        combine_docs_chain_kwargs={"prompt": _QA_PROMPT},
        return_source_documents=True,  # return original documents for citation
    )
    return chain

Key notes: output_key="answer" must be set when return_source_documents=True so the memory knows where to store the answer. return_messages=True makes the memory compatible with Chat‑type LLMs; set to False for text‑completion models.

The memory window k should be 3–8; larger values may exceed LLM context limits.

3.4 Source Attribution

After retrieval, the source file names are appended to the answer if they are not already mentioned.

def ask(chain, question: str) -> str:
    result = chain({"question": question}, callbacks=callbacks)
    answer = result["answer"]
    source_docs = result.get("source_documents", [])
    sources = set()
    for doc in source_docs:
        source = doc.metadata.get("source", "unknown source")
        sources.add(source)
    if sources:
        source_text = ", ".join(sources)
        if source_text not in answer:
            answer += f"

📄 Source: {source_text}"
    return answer

4. LLM Integration

4.1 OpenAI‑compatible Interface (Recommended)

# config.py
from langchain_openai import ChatOpenAI

def get_llm():
    return ChatOpenAI(
        model=LLM_MODEL_NAME,  # e.g., "glm-5"
        api_key=LLM_API_KEY,
        base_url=LLM_API_BASE,  # e.g., "https://api.modelarts-maas.com/openai/v1"
    )

Best practices:

Prefer ChatOpenAI (chat‑type model) over plain LLM because the former natively supports message format and works better with the chain’s memory mechanism.

Manage credentials via environment variables; never hard‑code API keys.

Switching models only requires updating three variables in .env without code changes.

5. Observability with Langfuse

5.1 Why Observe RAG?

RAG systems can suffer from silent quality issues such as irrelevant retrieval results, question‑rephrasing drift, or LLM hallucinations. Monitoring each step helps pinpoint failures.

5.2 Integration (Langfuse 3.x)

# config.py — environment variables automatically read LANGFUSE_PUBLIC_KEY / SECRET_KEY / HOST
from langfuse.langchain import CallbackHandler

def get_langfuse_handler():
    if not LANGFUSE_ENABLED:
        return None
    return CallbackHandler(update_trace=True)

# qa_chain.py — inject handler into chain call
handler = get_langfuse_handler()
callbacks = [handler] if handler else None
result = chain({"question": question}, callbacks=callbacks)

When LANGFUSE_ENABLED is false or missing, the handler is omitted, providing zero‑intrusion fallback.

6. Configuration Management Best Practices

# config.py
from dotenv import load_dotenv
import os

load_dotenv()  # Load variables from .env file
CHUNK_SIZE = int(os.getenv("CHUNK_SIZE", "1000"))
CHUNK_OVERLAP = int(os.getenv("CHUNK_OVERLAP", "100"))
TOP_K = int(os.getenv("TOP_K", "3"))
MEMORY_ROUNDS = int(os.getenv("MEMORY_ROUNDS", "5"))

All runtime parameters are centralized in environment variables; the code contains no hard‑coded secrets.

.env example (essential variables only):

LLM_API_KEY=your_api_key
LLM_API_BASE=https://api.modelarts-maas.com/openai/v1
LLM_MODEL_NAME=glm-5
EMBEDDINGS_PROVIDER=modelarts
EMBEDDINGS_API_BASE=https://api.modelarts-maas.com/v1
EMBEDDINGS_MODEL=bge-m3
CHUNK_SIZE=1000
CHUNK_OVERLAP=100
TOP_K=3
MEMORY_ROUNDS=5
LANGFUSE_PUBLIC_KEY=pk-lf-xxx
LANGFUSE_SECRET_KEY=sk-lf-xxx
LANGFUSE_HOST=http://localhost:3000
LANGFUSE_ENABLED=True

Python prompt engineering observability LangChain RAG FAISS LLMOps

Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.