How RAG Architecture Evolves: From Simple Chains to Flexible RAG Flows
This article examines the evolution of Retrieval‑Augmented Generation (RAG) architectures for large language models, outlines the challenges they face, introduces the modular RAG Flow concept with four workflow paradigms, and provides a step‑by‑step implementation using LangChain and LlamaIndex with code examples.
LLM Application Architecture Evolution
As large language models (LLMs) mature, downstream B‑side applications such as knowledge‑intensive RAG search and AI agents are moving beyond simple prompt engineering and linear chain architectures. Modern LLM‑driven applications now require more flexible, orchestrated workflows to meet higher task‑level capabilities.
Challenges of RAG Applications
RAG faces several practical hurdles. Knowledge recall precision depends on effective indexing and retrieval, which involves document loading, chunking, embedding selection, and handling multimodal data. The LLM’s own generation ability is critical; model size, parameter tuning, and anti‑hallucination capabilities affect output quality. Additional technical limits include context‑window constraints, latency from many processing steps, and the need to merge multi‑turn dialogue context for retrieval.
From RAG to RAG Flow
Recent research (e.g., the survey "Retrieval‑Augmented Generation for Large Language Models") proposes a modular RAG architecture that decomposes the pipeline into reusable modules , module classes , and operators/algorithms . Developers can freely compose these components, forming a flexible workflow called a RAG Flow .
RAG Flow Paradigms
The inference stage of a RAG Flow can follow four basic patterns:
Sequential paradigm : classic retrieve‑then‑generate with added pre‑retrieval (e.g., query rewrite) and post‑retrieval (e.g., rerank) modules.
Conditional paradigm : routing decisions based on keywords or semantics direct the request to different knowledge bases, models, or prompts.
Branch paradigm : parallel branches execute simultaneously, such as multiple retrievals or generation paths, and their results are later merged.
Loop paradigm : iterative or recursive cycles perform repeated retrieval‑generation steps, often with adaptive stopping criteria; Self‑RAG is a notable example.
Implementing a RAG Flow
Two mainstream frameworks— LangChain and LlamaIndex —support modular RAG construction. LangChain offers broader LLM capabilities, while LlamaIndex provides out‑of‑the‑box modules for retrieval, embedding, and fine‑tuning.
Below is a concise implementation using LangChain:
# Load knowledge documents
loader = TextLoader(file_path="test.txt")
document = loader.load()
# Split documents into chunks
text_splitter = CharacterTextSplitter(separator="
", chunk_size=500, chunk_overlap=0)
data = text_splitter.split_documents(document)
# Embed and store in a vector store
db = Chroma.from_documents(documents=data, embedding_function=OpenAIEmbeddings(), persist_directory="./chrom_db")
# Create a retriever
retriever = db.as_retriever(k=5)Query expansion is built with a simple LLM chain:
llm = ChatOpenAI(temperature=0.0, model_name='gpt-3.5-turbo-1106')
template = """
你是一个聪明的AI助手,能够根据输入的单个查询生成多个相关的查询问题.
请根据如下查询内容生成5个相关的搜索查询: {query}
"""
prompt = ChatPromptTemplate.from_template(template)
generate_queries = (
prompt | llm | StrOutputParser() | (lambda x: x.split("
"))
)Reranking uses Reciprocal Rank Fusion (RRF):
def reciprocal_rank_fusion(results: list[list], k=60):
fused_scores = {}
for docs in results:
for rank, doc in enumerate(docs):
doc_str = dumps(doc)
if doc_str not in fused_scores:
fused_scores[doc_str] = 0
fused_scores[doc_str] += 1 / (rank + k)
reranked_results = [
(loads(doc), score)
for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
]
return reranked_resultsCompose the full RAG Flow chain:
# Fusion chain: generate queries → retrieve → RRF
ragfusion_chain = generate_queries | retriever.map() | reciprocal_rank_fusion
# Final generation chain
template = """
基于如下上下文回答问题:
{context}
===
问题: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
final_chain = ({"context": ragfusion_chain, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())
# Test invocation
result = final_chain.invoke({"question": "请介绍个人所得税专项附加扣除的政策"})
print(result)Conclusion
The shift from simple sequential RAG pipelines to modular, workflow‑driven RAG Flows enables more complex business scenarios, improves task accuracy, and leverages the full potential of LLMs. By combining multiple models, retrieval strategies, and iterative reasoning, developers can build robust AI agents and knowledge‑enhanced applications.
Reference links (GitHub, Medium, and the original survey) provide further details for readers who wish to explore the implementations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
