Build a Complete Private Knowledge Base with RAG: A Hands‑On Guide
This article walks through a complete, production‑ready Retrieval‑Augmented Generation pipeline that lets AI answer a company’s private documents, covering chunking strategies, embedding model choices, vector‑database selection, retrieval methods, full LangChain chain assembly, and common pitfalls to avoid.
Retrieval‑Augmented Generation (RAG)
RAG = retrieve first, then generate. The model answers a question by first fetching relevant passages from a knowledge base and then composing the answer based on those passages.
Why RAG instead of fine‑tuning?
Knowledge update: Updating a vector store takes seconds; fine‑tuning requires hours‑to‑days of retraining.
Cost: RAG uses API calls + a vector DB (low cost); fine‑tuning needs GPU compute (high cost).
Hallucination risk: RAG can cite sources, making results traceable; fine‑tuned models may “mis‑remember”.
Suitable scenarios: RAG fits private, frequently‑updated knowledge; fine‑tuning is for a fixed‑format, brand‑voice output.
RAG workflow
Two stages:
Indexing (offline, run once or on update):
Document → Chunking → Embedding → Store in vector DB
Query (online, per conversation):
User question → Embedding → Similarity search → Retrieve Top‑K chunks → Insert into Prompt → LLM generates answerStage 1 – Document chunking
Chunking determines retrieval quality.
Fixed‑length chunking (common but error‑prone)
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # 20% overlap to avoid cutting sentences
separators=["
", "
", "。", "!", "?", " ", ""]
)
docs = splitter.split_text(raw_text)Common mistake: chunk_overlap=0 splits sentences in half, producing unintelligible chunks.
Correct practice: set chunk_overlap to 10‑20% of chunk_size.
Semantic chunking (better for structured texts)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=85 # split when similarity exceeds 85%
)
docs = splitter.create_documents([raw_text])Semantic chunking yields semantically complete chunks but is slower because it calls the embedding model for each potential breakpoint; suitable for offline batch processing.
Stage 2 – Embedding
Embedding maps text to a numeric vector (e.g., 1536‑dimensional). Similar texts have vectors that are close in Euclidean or cosine space.
Embedding model choices
# Option A – OpenAI text‑embedding‑3‑small (cost‑effective)
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Option B – Local model (zero API cost, slightly lower quality)
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-m3",
model_kwargs={"device": "cpu"}
)
# Quick sanity check – two synonymous Chinese sentences should have cosine similarity > 0.9
vec1 = embeddings.embed_query("如何重置密码")
vec2 = embeddings.embed_query("忘记密码怎么办")
# Expected: similarity > 0.9Key principle: the same embedding model must be used for both indexing and query; mixing models breaks consistency.
Vector‑database options
Chroma – local development, zero‑config Python store.
Qdrant – production‑grade, high performance, supports metadata filtering.
Pinecone – managed cloud service, pay‑as‑you‑go.
pgvector – leverages an existing PostgreSQL instance, no extra infrastructure.
# Example: Chroma (local prototype)
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="my_knowledge_base"
)
# Example: Qdrant (production)
from langchain_qdrant import Qdrant
import qdrant_client
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vectorstore = Qdrant(
client=client,
collection_name="my_knowledge_base",
embeddings=embeddings
)Stage 3 – Retrieval strategies
Retrieval quality often dominates overall performance.
Basic similarity search
# Return top‑4 most similar chunks
results = vectorstore.similarity_search(query="如何申请年假", k=4)
# Retrieve with similarity scores (0‑1, higher = more relevant)
results_with_score = vectorstore.similarity_search_with_score(query="如何申请年假", k=4)
for doc, score in results_with_score:
print(f"Score: {score:.3f} | Content: {doc.page_content[:50]}...")Maximum Marginal Relevance (MMR)
MMR keeps relevance while maximizing diversity, avoiding repeated information.
# MMR retrieval (k=4, fetch 20 candidates, lambda=0.7 balances relevance vs diversity)
results = vectorstore.max_marginal_relevance_search(
query="如何申请年假",
k=4,
fetch_k=20,
lambda_mult=0.7
)Hybrid retrieval – vector + BM25 keyword search
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# Keyword retriever (effective for proper nouns, model numbers, etc.)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 4
# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Ensemble: 50% weight each (adjustable)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5]
)
results = ensemble_retriever.invoke("iPhone 14 的电池容量是多少")
# BM25 matches the exact term, vector finds semantically related paragraphs.Stage 4 – Full RAG chain assembly
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# 1. Initialise components
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings,
collection_name="my_knowledge_base"
)
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "fetch_k": 20}
)
# 2. Prompt that forces answer to be based only on retrieved context
rag_prompt = ChatPromptTemplate.from_template("""
You are a professional knowledge‑base assistant. Answer the user question based on the retrieved context below.
**Rules:**
- Respond only using the provided context; if the context lacks the answer, say "Based on the available data, I cannot find an answer."
- Keep the answer concise and cite the original text with quotes.
**Retrieved context:**
{context}
**User question:**
{question}
""")
# 3. Helper to format multiple chunks
def format_docs(docs):
return "
---
".join([
f"[Source: {doc.metadata.get('source', 'unknown')}]
{doc.page_content}" for doc in docs
])
# 4. Assemble the chain (LCEL style)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
# 5. Example invocation
answer = rag_chain.invoke("我们公司的年假政策是什么?")
print(answer)Version that also returns source documents
from langchain_core.runnables import RunnableParallel
rag_chain_with_source = RunnableParallel(
{
"answer": rag_chain,
"source_documents": retriever # keep original chunks
}
)
result = rag_chain_with_source.invoke("年假怎么申请?")
print("Answer:", result["answer"])
print("
Cited sources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'unknown')}: {doc.page_content[:80]}...")Stage 5 – Engineering document ingestion
import os
from pathlib import Path
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredWordDocumentLoader,
TextLoader,
UnstructuredMarkdownLoader,
)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
def load_documents(docs_dir: str) -> list:
"""Load PDF, Word, TXT, and Markdown files, attaching source metadata."""
documents = []
loaders = {
".pdf": PyPDFLoader,
".docx": UnstructuredWordDocumentLoader,
".txt": TextLoader,
".md": UnstructuredMarkdownLoader,
}
for file_path in Path(docs_dir).rglob("*"):
suffix = file_path.suffix.lower()
if suffix in loaders:
loader = loaders[suffix](str(file_path))
docs = loader.load()
for doc in docs:
doc.metadata["source"] = file_path.name
doc.metadata["file_path"] = str(file_path)
documents.extend(docs)
print(f"✅ Loaded: {file_path.name} ({len(docs)} chunks)")
return documents
def build_knowledge_base(docs_dir: str, persist_dir: str):
raw_docs = load_documents(docs_dir)
print(f"
Total loaded fragments: {len(raw_docs)}")
# Chunking (800 chars, 150 overlap)
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
separators=["
", "
", "。", "!", "?"]
)
chunks = splitter.split_documents(raw_docs)
print(f"After chunking: {len(chunks)} chunks")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
batch_size = 100
vectorstore = None
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
if vectorstore is None:
vectorstore = Chroma.from_documents(
batch, embeddings,
persist_directory=persist_dir,
collection_name="knowledge_base"
)
else:
vectorstore.add_documents(batch)
print(f"Progress: {min(i+batch_size, len(chunks))}/{len(chunks)}")
print(f"
✅ Knowledge base built: {len(chunks)} vectors")
return vectorstore
# Usage example
vectorstore = build_knowledge_base("./docs", "./chroma_db")Common pitfalls
Pitfall 1 – Chunk size too large
Using chunk_size=3000 creates noisy chunks that contain unrelated content, leading to off‑topic retrieval.
Recommended: chunk_size=600‑1000. For simple questions keep chunks small; for answers that need more context increase k (e.g., to 6).
Pitfall 2 – Duplicate ingestion
# ❌ Re‑ingest on every start → vector count grows indefinitely
vectorstore = Chroma.from_documents(docs, embeddings)
# ✅ Load existing store if present
if os.path.exists(persist_dir) and os.listdir(persist_dir):
vectorstore = Chroma(persist_directory=persist_dir, embedding_function=embeddings)
print("Loaded existing vector store")
else:
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory=persist_dir)
print("Created new vector store")Pitfall 3 – Language mismatch between query and documents
Querying English against Chinese documents yields poor similarity scores.
Solution: use a multilingual embedding model such as BAAI/bge-m3 or translate the query into the document language before retrieval.
Pitfall 4 – Too small k
k=2may miss relevant paragraphs when the answer spans multiple chunks.
Production recommendation: k=4‑6, increasing further if token budget permits.
Pitfall 5 – Prompt lacks “answer only from context” constraint
Without the constraint the model mixes its own knowledge with retrieved text, causing hallucinations.
Adding the explicit rule reduces hallucinations by roughly 80%.
Pre‑deployment checklist
Embedding model used for indexing and querying is identical. chunk_overlap ≥ 10% of chunk_size.
Each document chunk includes source metadata.
Prompt contains the “answer only from context” rule.
Retrieval k ≥ 4.
Ingestion process is idempotent (no duplicate vectors).
Hybrid retrieval (BM25 + vector) for domains with many proper nouns.
Summary of key findings
Chunking sets the upper bound. Using chunk_size=800 and overlap=150 works well; semantic chunking improves relevance by 20‑30% over fixed length.
Embedding selection. text-embedding-3-small offers the best cost‑performance for English; for Chinese content bge-m3 provides strong multilingual performance.
Layered retrieval. Start with basic similarity, add MMR for diversity, and combine with BM25 when handling product codes or brand names.
Prompt constraint. Explicitly requiring the model to answer only from the provided context cuts hallucinations by ~80%.
Engineering essentials. Ensure idempotent ingestion, attach source metadata, batch vector writes to respect API rate limits.
The central insight is that retrieval quality outweighs generation quality: the answer already exists in the documents; the challenge is locating the correct piece.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
