How RAG Architecture Evolves: From Simple Chains to Flexible RAG Flows

This article examines the evolution of Retrieval‑Augmented Generation (RAG) architectures for large language models, outlines the challenges they face, introduces the modular RAG Flow concept with four workflow paradigms, and provides a step‑by‑step implementation using LangChain and LlamaIndex with code examples.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
How RAG Architecture Evolves: From Simple Chains to Flexible RAG Flows

LLM Application Architecture Evolution

As large language models (LLMs) mature, downstream B‑side applications such as knowledge‑intensive RAG search and AI agents are moving beyond simple prompt engineering and linear chain architectures. Modern LLM‑driven applications now require more flexible, orchestrated workflows to meet higher task‑level capabilities.

Challenges of RAG Applications

RAG faces several practical hurdles. Knowledge recall precision depends on effective indexing and retrieval, which involves document loading, chunking, embedding selection, and handling multimodal data. The LLM’s own generation ability is critical; model size, parameter tuning, and anti‑hallucination capabilities affect output quality. Additional technical limits include context‑window constraints, latency from many processing steps, and the need to merge multi‑turn dialogue context for retrieval.

From RAG to RAG Flow

Recent research (e.g., the survey "Retrieval‑Augmented Generation for Large Language Models") proposes a modular RAG architecture that decomposes the pipeline into reusable modules , module classes , and operators/algorithms . Developers can freely compose these components, forming a flexible workflow called a RAG Flow .

RAG Flow Paradigms

The inference stage of a RAG Flow can follow four basic patterns:

Sequential paradigm : classic retrieve‑then‑generate with added pre‑retrieval (e.g., query rewrite) and post‑retrieval (e.g., rerank) modules.

Conditional paradigm : routing decisions based on keywords or semantics direct the request to different knowledge bases, models, or prompts.

Branch paradigm : parallel branches execute simultaneously, such as multiple retrievals or generation paths, and their results are later merged.

Loop paradigm : iterative or recursive cycles perform repeated retrieval‑generation steps, often with adaptive stopping criteria; Self‑RAG is a notable example.

Implementing a RAG Flow

Two mainstream frameworks— LangChain and LlamaIndex —support modular RAG construction. LangChain offers broader LLM capabilities, while LlamaIndex provides out‑of‑the‑box modules for retrieval, embedding, and fine‑tuning.

Below is a concise implementation using LangChain:

# Load knowledge documents
loader = TextLoader(file_path="test.txt")
document = loader.load()

# Split documents into chunks
text_splitter = CharacterTextSplitter(separator="
", chunk_size=500, chunk_overlap=0)
data = text_splitter.split_documents(document)

# Embed and store in a vector store
db = Chroma.from_documents(documents=data, embedding_function=OpenAIEmbeddings(), persist_directory="./chrom_db")

# Create a retriever
retriever = db.as_retriever(k=5)

Query expansion is built with a simple LLM chain:

llm = ChatOpenAI(temperature=0.0, model_name='gpt-3.5-turbo-1106')

template = """
你是一个聪明的AI助手,能够根据输入的单个查询生成多个相关的查询问题.
请根据如下查询内容生成5个相关的搜索查询: {query}
"""
prompt = ChatPromptTemplate.from_template(template)

generate_queries = (
    prompt | llm | StrOutputParser() | (lambda x: x.split("
"))
)

Reranking uses Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(results: list[list], k=60):
    fused_scores = {}
    for docs in results:
        for rank, doc in enumerate(docs):
            doc_str = dumps(doc)
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            fused_scores[doc_str] += 1 / (rank + k)
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]
    return reranked_results

Compose the full RAG Flow chain:

# Fusion chain: generate queries → retrieve → RRF
ragfusion_chain = generate_queries | retriever.map() | reciprocal_rank_fusion

# Final generation chain
template = """
基于如下上下文回答问题:
{context}
===
问题: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
final_chain = ({"context": ragfusion_chain, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())

# Test invocation
result = final_chain.invoke({"question": "请介绍个人所得税专项附加扣除的政策"})
print(result)

Conclusion

The shift from simple sequential RAG pipelines to modular, workflow‑driven RAG Flows enables more complex business scenarios, improves task accuracy, and leverages the full potential of LLMs. By combining multiple models, retrieval strategies, and iterative reasoning, developers can build robust AI agents and knowledge‑enhanced applications.

Reference links (GitHub, Medium, and the original survey) provide further details for readers who wish to explore the implementations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMLangChainRAGretrieval
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.