Artificial Intelligence 16 min read

How Retrieval‑Augmented Generation (RAG) Supercharges LLM Answers – Complete Guide & Code

This article explains Retrieval‑Augmented Generation (RAG), detailing its offline knowledge‑base construction and online retrieval‑enhanced generation workflow, comparing it with traditional and fine‑tuned models, and providing step‑by‑step LangChain implementations, advanced techniques, and practical use‑case demos.

Qborfy AI

Feb 18, 2026

How Retrieval‑Augmented Generation (RAG) Supercharges LLM Answers – Complete Guide & Code

Retrieval‑Augmented Generation (RAG) lets large language models (LLMs) answer questions by first looking up relevant information in an external knowledge base, turning a "closed‑book" model into an "open‑book" one.

What RAG Is

RAG works by retrieving documents that match the user query, feeding those documents as context to the LLM, and generating an answer that can cite the source. It does not permanently store new knowledge in the model; instead, it dynamically pulls up‑to‑date facts at inference time.

Core Value

The main benefit is solving the knowledge‑staleness, domain‑specificity, and hallucination problems of pure LLMs. By accessing the latest or private documents, RAG avoids costly re‑training while improving factual accuracy.

RAG Workflow

The process consists of two key stages:

Offline stage – Knowledge‑base construction

Extract raw text from PDFs, Word files, web pages, etc.

Split the text into manageable chunks ("Chunking").

Encode each chunk with an embedding model.

Store the vectors in a vector database (e.g., Chroma, Pinecone, Milvus).

Online stage – Retrieval‑augmented generation

Convert the user question into an embedding.

Perform semantic similarity search in the vector DB to retrieve the top‑K relevant chunks.

Combine the retrieved chunks with the original question to build a prompt.

Feed the prompt to the LLM, which generates an answer together with source citations.

Core Components

1. Knowledge‑base Construction

Document processing pipeline:

原始文档 → 文本提取 → 分块（Chunking） → 向量化（Embedding） → 存入向量数据库

Document loading : supports PDF, Word, HTML, Markdown, etc.

Chunking : splits long documents into smaller pieces for efficient retrieval.

Embedding : transforms each chunk into a dense vector using models such as OpenAIEmbeddings.

Vector storage : persists vectors in a specialized vector DB.

2. Retrieval System

Semantic search : finds documents based on vector similarity.

Hybrid search : combines keyword matching with semantic similarity.

Reranking : applies a second‑stage model (e.g., CohereRerank) to reorder results for higher relevance.

3. Generation System

Prompt engineering : designs a template that injects retrieved context and the user query.

Context management : controls the amount of retrieved text fed to the LLM to stay within token limits.

Answer generation : the LLM produces the final response, optionally citing the source documents.

RAG vs. Other Approaches

Compared with a plain LLM and a fine‑tuned model:

Knowledge update : traditional models need full re‑training; fine‑tuning requires another training round; RAG only needs to add or modify documents.

Cost : inference cost for a plain model is low; fine‑tuning incurs high training cost; RAG adds moderate retrieval cost plus inference.

Accuracy : plain models may hallucinate; fine‑tuned models are more accurate but fixed; RAG provides traceable answers based on real documents.

Professional knowledge : fine‑tuned models rely on generic data; RAG can inject domain‑specific manuals, regulations, or research papers.

Explainability : RAG can show the exact source chunk, making the answer explainable.

Practical Use‑Cases

Case 1 – Enterprise Knowledge‑Base Q&A

Goal: let employees quickly find policies, technical manuals, or procedures.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Document preparation (offline)
documents = [
    "公司年假政策：员工入职满一年享有5天年假...",
    "报销流程：员工需在费用发生后30天内提交...",
    "技术栈规范：前端统一使用 React + TypeScript...",
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.create_documents(documents)

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 2. User query
question = "我入职半年了，可以请年假吗？"
relevant_docs = vectorstore.similarity_search(question, k=3)

prompt = f"""基于以下文档内容回答问题：

文档内容：{relevant_docs[0].page_content}

用户问题：{question}

请基于文档内容准确回答，如果文档中没有相关信息，请明确说明。"""

from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)
# Output: 根据公司年假政策，员工需要入职满一年才能享有年假。您目前入职半年，暂时不符合年假申请条件。

Case 2 – Intelligent Customer Service

Goal: automatically answer e‑commerce queries about products, logistics, and after‑sales.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Assume `vectorstore` already contains product FAQs, logistics policies, etc.
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
)

result = qa_chain({"query": "iPhone 15 Pro 支持哪些颜色？"})
print(f"答案：{result['result']}")
print(f"来源：{result['source_documents'][0].metadata.get('source')}")

Case 3 – Academic Paper Assistant

Goal: let researchers query large collections of PDFs and retrieve specific sections.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("research_paper.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, separators=["

", "
", "。", "！", "？", " ", ""])
chunks = text_splitter.split_documents(documents)

# Add metadata for citation
for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "source": "research_paper.pdf",
        "page": chunk.metadata.get("page", 0),
        "chunk_id": i,
    })

vectorstore = Chroma.from_documents(chunks, embeddings)

End‑to‑End RAG System (LangChain)

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

print("正在加载文档...")
loader = DirectoryLoader('./knowledge_base/', glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"已加载 {len(documents)} 个文档")

print("正在分块处理...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
print(f"已分割为 {len(chunks)} 个文本块")

print("正在向量化...")
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
print("向量数据库已创建")

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0, model="gpt-4"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3}),
    return_source_documents=True,
)

print("
知识库问答系统已启动！输入 'quit' 退出
")
while True:
    question = input("请输入问题：")
    if question.lower() == 'quit':
        break
    result = qa_chain({"query": question})
    print(f"
答案：{result['result']}
")
    print("参考来源：")
    for i, doc in enumerate(result['source_documents'], 1):
        print(f"{i}. {doc.metadata.get('source', '未知来源')}")
    print("-" * 50 + "
")

Advanced Techniques

1. Hybrid Search (Keyword + Semantic)

from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 3
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, semantic_retriever], weights=[0.5, 0.5])

2. Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

compressor = CohereRerank(model="rerank-english-v2.0")
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=vectorstore.as_retriever())

3. Multi‑Query Retrieval

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=OpenAI(temperature=0)
)
results = multi_query_retriever.get_relevant_documents("什么是 RAG？")

Practical Tips

Chunk size matters : too small loses context, too large adds noise. 500‑1000 characters works for most docs; Chinese text often needs smaller chunks.

Overlap proportion : set chunk_overlap to 10‑20% of chunk_size to preserve cross‑chunk information without excessive duplication.

Choosing a vector DB :

Chroma – lightweight, great for prototypes.

Pinecone – managed cloud service for production.

Milvus – open‑source, suitable for large‑scale deployments.

Weaviate – supports hybrid search out of the box.

Embedding model impact : OpenAI text-embedding-3-large yields strong retrieval quality but at higher cost; Chinese models like Zhipu embedding-2 offer a better price‑performance trade‑off.

Hallucination mitigation : high‑quality retrieval is the core guard against hallucinations; irrelevant chunks can still cause false answers.

Relation to Function Calling : RAG can be viewed as a specialized function call where the function performs a knowledge‑base lookup; many applications combine explicit function calls with RAG for conditional retrieval.

Emerging GraphRAG : Microsoft’s GraphRAG builds a knowledge graph from the corpus, enabling richer entity‑level reasoning and more complex queries.

References

LangChain RAG official tutorial – https://python.langchain.com/docs/use_cases/question_answering/

OpenAI Embeddings documentation – https://platform.openai.com/docs/guides/embeddings

Pinecone RAG best practices – https://www.pinecone.io/learn/retrieval-augmented-generation/

Microsoft GraphRAG paper – https://arxiv.org/abs/2404.16130

LlamaIndex RAG framework – https://docs.llamaindex.ai/en/stable/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Prompt engineering LangChain RAG Vector Database Retrieval Augmented Generation embedding models Hybrid Search

Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.