RAGFlow Primer Part 1: Introduction and Concept Deep Dive
This article provides a comprehensive technical overview of RAGFlow, an industrial‑grade Retrieval‑Augmented Generation platform, detailing its architecture, core components such as DeepDoc, intelligent chunking, embedding integration, multi‑stage retrieval, and agent workflow, while comparing it with traditional RAG shortcomings.
Introduction
RAGFlow is an enterprise‑focused Retrieval‑Augmented Generation (RAG) framework that addresses limitations of traditional RAG systems by providing high‑precision document parsing, layout‑aware chunking, traceable citations, and multi‑step reasoning.
Traditional RAG pain points
Document parsing quality : loss of table structure and ignored images → Solution : DeepDoc engine with layout‑aware recognition.
Chunking strategy : fixed‑length cuts break semantic continuity → Solution : layout‑aware chunking that preserves headings, tables, and paragraphs.
Lack of explainability : answers cannot be verified → Solution : detailed clickable citations.
Inability for complex reasoning : only simple Q&A → Solution : Agent workflow supporting multi‑step reasoning.
RAG concept recap
RAG consists of three steps: (1) Retrieval – fetch relevant documents; (2) Augmentation – provide retrieved evidence to the LLM; (3) Generation – produce an answer grounded in the evidence.
Concrete comparison
Question: “What was our company’s Q3 2024 sales?”
Traditional LLM response :
Sorry, I cannot provide your company's specific sales data because I have no access to real‑time or private information. Please refer to the financial statements.RAGFlow response :
According to the 2024 Q3 financial report (page 3), the sales amount is 12.45 million CNY, a 15.2 % YoY increase.
- Product A: 4.56 M (36.7 %)
- Product B: 2.34 M (18.8 %)
- Product C: 5.55 M (44.5 %)
[Source: 2024 Q3 financial report.pdf, page 3]Core concepts and implementation
Knowledge Base
A Knowledge Base (KB) groups documents that share the same processing pipeline (chunking method, embedding model, retrieval strategy). The KB API is implemented in api/apps/kb_app.py and metadata is stored in a relational database.
Text chunking
RAGFlow defines multiple chunking strategies in CHUNKER_FACTORY:
CHUNKER_FACTORY = {
"general": general_chunker, # generic documents
"naive": naive_chunker, # fast plain‑text
"manual": manual_chunker, # technical manuals
"paper": paper_chunker, # academic papers
"book": book_chunker, # long books
"laws": laws_chunker, # legal texts
"presentation": ppt_chunker # slides
}Unlike fixed‑length chunking, layout‑aware chunking respects semantic and structural boundaries, preserving headings, tables, and paragraphs.
Embedding
Chunks are transformed into high‑dimensional vectors via embedding models (e.g., bge-large-zh-v1.5) using the MCP protocol. The embedding logic resides in rag/utils/mcp_tool_call_conn.py, enabling seamless model swapping.
Retrieval engine
The Searcher class ( rag/nlp/search.py) implements a multi‑recall pipeline:
Vector search (semantic KNN)
BM25 keyword search
Hybrid search (vector + keyword)
Graph‑based knowledge‑graph search
Results from each recall are fused, deduplicated, and optionally reranked with a dedicated model. Performance metrics (total time, recall time, fusion time, rerank time) are recorded in the SearchResult object.
Agent workflow
The Agent engine ( agent/canvas.py) parses a DSL describing a directed graph of components. Execution follows the ReAct loop (Reason → Act → Observe):
Identify the start node.
Execute each component, handling LLM tool calls.
Record execution traces, tool usage, and token consumption.
Return a structured AgentExecutionResult with intermediate results and performance data.
Built‑in tools include search, Python code execution, web search, API calls, file reading, email, and database queries.
DeepDoc – Layout‑aware document understanding
DeepDoc parses PDFs, Word, Excel, and HTML while preserving structural information:
# Directory layout of DeepDoc parsers
deepDoc/
├── parser/
│ ├── pdf_parser.py # layout analysis + OCR
│ ├── docx_parser.py # structured extraction
│ ├── excel_parser.py # table handling
│ └── html_parser.py # web page structure
└── vision/
└── layout_recognizer.py # page layout detectionIt extracts titles, tables, images, lists, and paragraphs, enabling downstream modules to work with semantically rich chunks.
End‑to‑end document processing pipeline
# rag/svr/task_executor.py – core pipeline
async def build_chunks(task, progress_callback):
# 1. Choose parser based on document type
chunker = FACTORY[task["parser_id"].lower()]
# 2. Retrieve binary from storage
binary = await get_storage_binary(...)
# 3. Parse with DeepDoc (async, semaphore‑controlled)
cks = await trio.to_thread.run_sync(
lambda: chunker.chunk(..., binary=binary, ...)
)
# 4. Embed chunks in batch
vectors = await embed_chunks_batch(cks, model_name="bge-large-zh-v1.5")
# 5. Store in Elasticsearch for retrieval
await store_chunks_to_es(cks, vectors, task["kb_id"])
# 6. Update metadata and report progress
await update_document_status(...)
return len(cks)Project structure
ragflow/
├── web/ # Front‑end (React + TypeScript)
├── api/ # Backend API services
│ ├── apps/ # Business modules (KB, dialogue, …)
│ ├── db/ # Database models and services
│ └── ragflow_server.py # Main service entry
├── deepdoc/ # Deep document understanding
│ ├── parser/ # PDF, Word, Excel, HTML parsers
│ └── vision/ # Visual layout module
├── rag/ # Core RAG engine
│ ├── nlp/ # NLP utilities (tokenization, search)
│ ├── flow/ # Data processing pipeline
│ └── llm/ # LLM integration
├── agent/ # Intelligent agent framework
│ ├── component/ # Agent components
│ ├── tools/ # Tool implementations
│ └── canvas.py # Workflow execution engine
├── docker/ # Docker deployment configuration
├── conf/ # System configuration files
└── mcp/ # MCP protocol service (model calls)Retrieval implementation details
Key methods in Searcher illustrate the full RAG flow:
class Searcher:
async def search(self, query: str, top_k: int = 10, chat_history: List = None) -> SearchResult:
# 1. Intelligent query processing (multi‑turn)
processed_query = await self._intelligent_query_processing(query, chat_history)
# 2. Parallel multi‑recall
vector_results = await self._vector_search(processed_query, top_k * 3)
bm25_results = await self._bm25_search(processed_query, top_k * 3)
hybrid_results = await self._hybrid_search(processed_query, top_k * 3)
graph_results = await self._graph_search(processed_query, top_k * 2)
# 3. Fusion, deduplication and ranking
fused = self._intelligent_fusion(...)
# 4. Optional reranking with a dedicated model
if self.reranker:
fused = await self._rerank(fused, processed_query)
# 5. Post‑processing (snippet extraction, filtering)
final = await self._post_process_results(fused[:top_k], processed_query)
return SearchResult(chunks=final, query_info=processed_query, performance_metrics={...})Agent workflow execution
The canvas engine builds a component graph from a DSL and executes it using the ReAct loop:
# agent/canvas.py – core execution
class Canvas(Graph):
def __init__(self, dsl: str, tenant_id=None, task_id=None):
self.dsl = json.loads(dsl) if isinstance(dsl, str) else dsl
self.tenant_id = tenant_id
self.task_id = task_id
self._build_component_graph()
self.execution_context = ExecutionContext()
self.tool_registry = self._init_tool_registry()
async def run(self, **kwargs) -> AgentExecutionResult:
start_node = self._find_start_node()
self.execution_context.initialize(initial_input=kwargs, tenant_id=self.tenant_id, task_id=self.task_id)
final_result = await self._execute_node(start_node, {})
return AgentExecutionResult(success=True, final_output=final_result, ...)
async def _execute_node(self, node_id: str, inputs: dict) -> dict:
component = self.get_node(node_id)
node_inputs = self._prepare_node_inputs(node_id, inputs)
output = await component.run(node_inputs)
if component.component_name == "LLM" and "tool_calls" in output:
output = await self._handle_tool_calls(output, node_id)
next_nodes = self.get_outgoing_nodes(node_id)
if next_nodes:
return await self._execute_node(next_nodes[0], output)
return outputSummary
RAGFlow combines precise document parsing, semantic‑aware chunking, traceable citations, multi‑modal retrieval, and programmable agent workflows to overcome the “forgetful” nature of vanilla LLMs. The open‑source codebase demonstrates concrete implementations of each component, making RAGFlow a practical reference for building production‑grade Retrieval‑Augmented Generation solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
