Build a Complete Private Knowledge Base with RAG: A Hands‑On Guide

This article walks through a complete, production‑ready Retrieval‑Augmented Generation pipeline that lets AI answer a company’s private documents, covering chunking strategies, embedding model choices, vector‑database selection, retrieval methods, full LangChain chain assembly, and common pitfalls to avoid.

James' Growth Diary
James' Growth Diary
James' Growth Diary
Build a Complete Private Knowledge Base with RAG: A Hands‑On Guide

Retrieval‑Augmented Generation (RAG)

RAG = retrieve first, then generate. The model answers a question by first fetching relevant passages from a knowledge base and then composing the answer based on those passages.

Why RAG instead of fine‑tuning?

Knowledge update: Updating a vector store takes seconds; fine‑tuning requires hours‑to‑days of retraining.

Cost: RAG uses API calls + a vector DB (low cost); fine‑tuning needs GPU compute (high cost).

Hallucination risk: RAG can cite sources, making results traceable; fine‑tuned models may “mis‑remember”.

Suitable scenarios: RAG fits private, frequently‑updated knowledge; fine‑tuning is for a fixed‑format, brand‑voice output.

RAG workflow

Two stages:

Indexing (offline, run once or on update):
Document → Chunking → Embedding → Store in vector DB

Query (online, per conversation):
User question → Embedding → Similarity search → Retrieve Top‑K chunks → Insert into Prompt → LLM generates answer
RAG overall architecture diagram
RAG overall architecture diagram

Stage 1 – Document chunking

Chunking determines retrieval quality.

Fixed‑length chunking (common but error‑prone)

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,  # 20% overlap to avoid cutting sentences
    separators=["

", "
", "。", "!", "?", " ", ""]
)

docs = splitter.split_text(raw_text)

Common mistake: chunk_overlap=0 splits sentences in half, producing unintelligible chunks.

Correct practice: set chunk_overlap to 10‑20% of chunk_size.

Semantic chunking (better for structured texts)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85  # split when similarity exceeds 85%
)

docs = splitter.create_documents([raw_text])

Semantic chunking yields semantically complete chunks but is slower because it calls the embedding model for each potential breakpoint; suitable for offline batch processing.

Chunking strategy comparison diagram
Chunking strategy comparison diagram

Stage 2 – Embedding

Embedding maps text to a numeric vector (e.g., 1536‑dimensional). Similar texts have vectors that are close in Euclidean or cosine space.

Embedding model choices

# Option A – OpenAI text‑embedding‑3‑small (cost‑effective)
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Option B – Local model (zero API cost, slightly lower quality)
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": "cpu"}
)

# Quick sanity check – two synonymous Chinese sentences should have cosine similarity > 0.9
vec1 = embeddings.embed_query("如何重置密码")
vec2 = embeddings.embed_query("忘记密码怎么办")
# Expected: similarity > 0.9

Key principle: the same embedding model must be used for both indexing and query; mixing models breaks consistency.

Vector‑database options

Chroma – local development, zero‑config Python store.

Qdrant – production‑grade, high performance, supports metadata filtering.

Pinecone – managed cloud service, pay‑as‑you‑go.

pgvector – leverages an existing PostgreSQL instance, no extra infrastructure.

# Example: Chroma (local prototype)
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_knowledge_base"
)

# Example: Qdrant (production)
from langchain_qdrant import Qdrant
import qdrant_client
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vectorstore = Qdrant(
    client=client,
    collection_name="my_knowledge_base",
    embeddings=embeddings
)
Vector database selection diagram
Vector database selection diagram

Stage 3 – Retrieval strategies

Retrieval quality often dominates overall performance.

Basic similarity search

# Return top‑4 most similar chunks
results = vectorstore.similarity_search(query="如何申请年假", k=4)

# Retrieve with similarity scores (0‑1, higher = more relevant)
results_with_score = vectorstore.similarity_search_with_score(query="如何申请年假", k=4)
for doc, score in results_with_score:
    print(f"Score: {score:.3f} | Content: {doc.page_content[:50]}...")

Maximum Marginal Relevance (MMR)

MMR keeps relevance while maximizing diversity, avoiding repeated information.

# MMR retrieval (k=4, fetch 20 candidates, lambda=0.7 balances relevance vs diversity)
results = vectorstore.max_marginal_relevance_search(
    query="如何申请年假",
    k=4,
    fetch_k=20,
    lambda_mult=0.7
)

Hybrid retrieval – vector + BM25 keyword search

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Keyword retriever (effective for proper nouns, model numbers, etc.)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 4

# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Ensemble: 50% weight each (adjustable)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]
)

results = ensemble_retriever.invoke("iPhone 14 的电池容量是多少")
# BM25 matches the exact term, vector finds semantically related paragraphs.
Retrieval strategy comparison diagram
Retrieval strategy comparison diagram

Stage 4 – Full RAG chain assembly

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Initialise components
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_knowledge_base"
)
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20}
)

# 2. Prompt that forces answer to be based only on retrieved context
rag_prompt = ChatPromptTemplate.from_template("""
You are a professional knowledge‑base assistant. Answer the user question based on the retrieved context below.

**Rules:**
- Respond only using the provided context; if the context lacks the answer, say "Based on the available data, I cannot find an answer."
- Keep the answer concise and cite the original text with quotes.

**Retrieved context:**
{context}

**User question:**
{question}
""")

# 3. Helper to format multiple chunks
def format_docs(docs):
    return "

---

".join([
        f"[Source: {doc.metadata.get('source', 'unknown')}]
{doc.page_content}" for doc in docs
    ])

# 4. Assemble the chain (LCEL style)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# 5. Example invocation
answer = rag_chain.invoke("我们公司的年假政策是什么?")
print(answer)

Version that also returns source documents

from langchain_core.runnables import RunnableParallel

rag_chain_with_source = RunnableParallel(
    {
        "answer": rag_chain,
        "source_documents": retriever  # keep original chunks
    }
)

result = rag_chain_with_source.invoke("年假怎么申请?")
print("Answer:", result["answer"])
print("
Cited sources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'unknown')}: {doc.page_content[:80]}...")
Full RAG chain flow diagram
Full RAG chain flow diagram

Stage 5 – Engineering document ingestion

import os
from pathlib import Path
from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredWordDocumentLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

def load_documents(docs_dir: str) -> list:
    """Load PDF, Word, TXT, and Markdown files, attaching source metadata."""
    documents = []
    loaders = {
        ".pdf": PyPDFLoader,
        ".docx": UnstructuredWordDocumentLoader,
        ".txt": TextLoader,
        ".md": UnstructuredMarkdownLoader,
    }
    for file_path in Path(docs_dir).rglob("*"):
        suffix = file_path.suffix.lower()
        if suffix in loaders:
            loader = loaders[suffix](str(file_path))
            docs = loader.load()
            for doc in docs:
                doc.metadata["source"] = file_path.name
                doc.metadata["file_path"] = str(file_path)
            documents.extend(docs)
            print(f"✅ Loaded: {file_path.name} ({len(docs)} chunks)")
    return documents

def build_knowledge_base(docs_dir: str, persist_dir: str):
    raw_docs = load_documents(docs_dir)
    print(f"
Total loaded fragments: {len(raw_docs)}")

    # Chunking (800 chars, 150 overlap)
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=150,
        separators=["

", "
", "。", "!", "?"]
    )
    chunks = splitter.split_documents(raw_docs)
    print(f"After chunking: {len(chunks)} chunks")

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    batch_size = 100
    vectorstore = None
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i+batch_size]
        if vectorstore is None:
            vectorstore = Chroma.from_documents(
                batch, embeddings,
                persist_directory=persist_dir,
                collection_name="knowledge_base"
            )
        else:
            vectorstore.add_documents(batch)
        print(f"Progress: {min(i+batch_size, len(chunks))}/{len(chunks)}")
    print(f"
✅ Knowledge base built: {len(chunks)} vectors")
    return vectorstore

# Usage example
vectorstore = build_knowledge_base("./docs", "./chroma_db")
Document ingestion engineering diagram
Document ingestion engineering diagram

Common pitfalls

Pitfall 1 – Chunk size too large

Using chunk_size=3000 creates noisy chunks that contain unrelated content, leading to off‑topic retrieval.

Recommended: chunk_size=600‑1000. For simple questions keep chunks small; for answers that need more context increase k (e.g., to 6).

Pitfall 2 – Duplicate ingestion

# ❌ Re‑ingest on every start → vector count grows indefinitely
vectorstore = Chroma.from_documents(docs, embeddings)

# ✅ Load existing store if present
if os.path.exists(persist_dir) and os.listdir(persist_dir):
    vectorstore = Chroma(persist_directory=persist_dir, embedding_function=embeddings)
    print("Loaded existing vector store")
else:
    vectorstore = Chroma.from_documents(docs, embeddings, persist_directory=persist_dir)
    print("Created new vector store")

Pitfall 3 – Language mismatch between query and documents

Querying English against Chinese documents yields poor similarity scores.

Solution: use a multilingual embedding model such as BAAI/bge-m3 or translate the query into the document language before retrieval.

Pitfall 4 – Too small k

k=2

may miss relevant paragraphs when the answer spans multiple chunks.

Production recommendation: k=4‑6, increasing further if token budget permits.

Pitfall 5 – Prompt lacks “answer only from context” constraint

Without the constraint the model mixes its own knowledge with retrieved text, causing hallucinations.

Adding the explicit rule reduces hallucinations by roughly 80%.

Pre‑deployment checklist

Embedding model used for indexing and querying is identical. chunk_overlap ≥ 10% of chunk_size.

Each document chunk includes source metadata.

Prompt contains the “answer only from context” rule.

Retrieval k ≥ 4.

Ingestion process is idempotent (no duplicate vectors).

Hybrid retrieval (BM25 + vector) for domains with many proper nouns.

Summary of key findings

Chunking sets the upper bound. Using chunk_size=800 and overlap=150 works well; semantic chunking improves relevance by 20‑30% over fixed length.

Embedding selection. text-embedding-3-small offers the best cost‑performance for English; for Chinese content bge-m3 provides strong multilingual performance.

Layered retrieval. Start with basic similarity, add MMR for diversity, and combine with BM25 when handling product codes or brand names.

Prompt constraint. Explicitly requiring the model to answer only from the provided context cuts hallucinations by ~80%.

Engineering essentials. Ensure idempotent ingestion, attach source metadata, batch vector writes to respect API rate limits.

The central insight is that retrieval quality outweighs generation quality: the answer already exists in the documents; the challenge is locating the correct piece.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLangChainRAGEmbeddingPromptEngineeringVectorDB
James' Growth Diary
Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.