Artificial Intelligence 15 min read

Retrieval Augmented Generation (RAG): Concepts, Workflow, and LangChain Implementation

The article outlines LLM issues such as hallucination, outdated knowledge, and data privacy, then explains Retrieval‑Augmented Generation—detailing its data‑preparation and query‑time retrieval workflow, demonstrates a full LangChain implementation, and contrasts RAG with fine‑tuning as complementary strategies for up‑to‑date, grounded responses.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Retrieval Augmented Generation (RAG): Concepts, Workflow, and LangChain Implementation

Introduction

With the rapid development of large language models (LLMs), Retrieval Augmented Generation (RAG) has become a key technique for improving model reliability, reducing hallucinations, and ensuring data security. This article first outlines the main challenges of LLMs and then explains how RAG addresses them.

LLM Problems

Hallucination : Probabilistic generation can produce false information when no answer exists.

Timeliness : Models trained on static data (e.g., up to 2021) cannot answer up‑to‑date queries such as today’s movie listings.

Data Security : Uploading proprietary documents to public LLM services raises privacy concerns.

RAG mitigates these issues by retrieving external knowledge at query time, providing more accurate and current responses.

What is RAG?

RAG (Retrieval Augmented Generation) combines information retrieval with LLM prompting. The retrieved documents are injected into the prompt as context, allowing the model to generate answers grounded in up‑to‑date data.

RAG Workflow

The process consists of two main stages:

Data Preparation : Extraction → Chunking → Embedding → Storage.

Retrieval & Generation : Query embedding → Similarity search → Context injection → LLM answer generation.

Data Preparation Stage

This offline stage converts private data into vector embeddings and stores them in a vector database.

1. Data Extraction : Convert PDFs, Word, markdown, databases, APIs, etc., into a unified text format.

2. Chunking : Split documents into semantically coherent chunks (e.g., 500 characters with 10‑character overlap).

3. Embedding : Transform text chunks into dense vectors using models such as moka-ai/m3e-base or other HuggingFace embeddings.

4. Vector Store : Persist vectors in databases like FAISS (local), Chroma, Elasticsearch, Milvus, etc.

Application Stage

During query time, the system retrieves relevant chunks and feeds them to the LLM.

1. Data Retrieval : Similarity search (cosine, Euclidean) or full‑text search retrieves top‑k relevant documents.

2. Prompt Injection : The retrieved context is concatenated with a task description and the user question.

Example prompt (shown in code block):

prompt = f"""
  Give the answer to the user query delimited by triple backticks ```{query}```
  using the information given in context delimited by triple backticks ```{context}```.
  If there is no relevant information in the provided context, try to answer yourself,
  but tell user that you did not have any relevant context to base your answer on.
  Be concise and output the answer of size less than 80 tokens.
"""

Practical Example with LangChain

The following code demonstrates a complete RAG pipeline using LangChain.

Environment Setup

# 环境准备,安装相关依赖
pip install langchain sentence_transformers chromadb

Load Local Data

from langchain.document_loaders import TextLoader
loader = TextLoader("./data/paul_graham_essay.txt")
documents = loader.load()

Document Splitting

# 文档分割
from langchain.text_splitter import CharacterTextSplitter
# 创建拆分器
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=10)
# 拆分文档
documents = text_splitter.split_documents(documents)

Embedding

from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Chroma
# embedding model: m3e-base
model_name = "moka-ai/m3e-base"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
embedding = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

Persist Vectors

# 指定 persist_directory 将会把嵌入存储到磁盘上。
persist_directory = 'db'
db = Chroma.from_documents(documents, embedding, persist_directory=persist_directory)

Retriever

retriever = db.as_retriever()

Prompt Template

from langchain.prompts import ChatPromptTemplate
template = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)

RAG Chain Construction

from langchain_community.chat_models import ChatOllama
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
llm = ChatOllama(model='llama3')
rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
)
query = "What did the author do growing up?"
response = rag_chain.invoke(query)
print(response)

Running the above pipeline with a locally hosted llama3 model yields an answer such as:

Before college, Paul Graham worked on writing and programming outside school. He didn't write essays, but instead focused on writing short stories. His stories were not very good, having little plot and just characters with strong feelings.

RAG vs. Fine‑Tuning

RAG is comparable to an open‑book exam: the model can look up a reference book at inference time. Fine‑tuning is akin to memorizing knowledge through extensive training. Both techniques can complement each other—RAG provides up‑to‑date factual grounding, while fine‑tuning improves style, domain adaptation, and instruction following.

Conclusion

The article introduced the challenges of LLMs, explained the concept and workflow of Retrieval Augmented Generation, and provided a concrete LangChain implementation. It also compared RAG with fine‑tuning to help practitioners choose the appropriate strategy for their use cases.

LLMPrompt EngineeringLangChainRAGVector DatabaseembeddingRetrieval-Augmented Generation
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.