Artificial Intelligence 32 min read

RAGFlow Primer Part 1: Introduction and Concept Deep Dive

This article provides a comprehensive technical overview of RAGFlow, an industrial‑grade Retrieval‑Augmented Generation platform, detailing its architecture, core components such as DeepDoc, intelligent chunking, embedding integration, multi‑stage retrieval, and agent workflow, while comparing it with traditional RAG shortcomings.

Tech Freedom Circle

Sep 25, 2025

Introduction

RAGFlow is an enterprise‑focused Retrieval‑Augmented Generation (RAG) framework that addresses limitations of traditional RAG systems by providing high‑precision document parsing, layout‑aware chunking, traceable citations, and multi‑step reasoning.

Traditional RAG pain points

Document parsing quality : loss of table structure and ignored images → Solution : DeepDoc engine with layout‑aware recognition.

Chunking strategy : fixed‑length cuts break semantic continuity → Solution : layout‑aware chunking that preserves headings, tables, and paragraphs.

Lack of explainability : answers cannot be verified → Solution : detailed clickable citations.

Inability for complex reasoning : only simple Q&A → Solution : Agent workflow supporting multi‑step reasoning.

RAG concept recap

RAG consists of three steps: (1) Retrieval – fetch relevant documents; (2) Augmentation – provide retrieved evidence to the LLM; (3) Generation – produce an answer grounded in the evidence.

Concrete comparison

Question: “What was our company’s Q3 2024 sales?”

Traditional LLM response :

Sorry, I cannot provide your company's specific sales data because I have no access to real‑time or private information. Please refer to the financial statements.

RAGFlow response :

According to the 2024 Q3 financial report (page 3), the sales amount is 12.45 million CNY, a 15.2 % YoY increase.
- Product A: 4.56 M (36.7 %)
- Product B: 2.34 M (18.8 %)
- Product C: 5.55 M (44.5 %)
[Source: 2024 Q3 financial report.pdf, page 3]

Core concepts and implementation

Knowledge Base

A Knowledge Base (KB) groups documents that share the same processing pipeline (chunking method, embedding model, retrieval strategy). The KB API is implemented in api/apps/kb_app.py and metadata is stored in a relational database.

Text chunking

RAGFlow defines multiple chunking strategies in CHUNKER_FACTORY:

CHUNKER_FACTORY = {
    "general": general_chunker,   # generic documents
    "naive": naive_chunker,       # fast plain‑text
    "manual": manual_chunker,     # technical manuals
    "paper": paper_chunker,       # academic papers
    "book": book_chunker,         # long books
    "laws": laws_chunker,        # legal texts
    "presentation": ppt_chunker   # slides
}

Unlike fixed‑length chunking, layout‑aware chunking respects semantic and structural boundaries, preserving headings, tables, and paragraphs.

Embedding

Chunks are transformed into high‑dimensional vectors via embedding models (e.g., bge-large-zh-v1.5) using the MCP protocol. The embedding logic resides in rag/utils/mcp_tool_call_conn.py, enabling seamless model swapping.

Retrieval engine

The Searcher class ( rag/nlp/search.py) implements a multi‑recall pipeline:

Vector search (semantic KNN)

BM25 keyword search

Hybrid search (vector + keyword)

Graph‑based knowledge‑graph search

Results from each recall are fused, deduplicated, and optionally reranked with a dedicated model. Performance metrics (total time, recall time, fusion time, rerank time) are recorded in the SearchResult object.

Agent workflow

The Agent engine ( agent/canvas.py) parses a DSL describing a directed graph of components. Execution follows the ReAct loop (Reason → Act → Observe):

Identify the start node.

Execute each component, handling LLM tool calls.

Record execution traces, tool usage, and token consumption.

Return a structured AgentExecutionResult with intermediate results and performance data.

Built‑in tools include search, Python code execution, web search, API calls, file reading, email, and database queries.

DeepDoc – Layout‑aware document understanding

DeepDoc parses PDFs, Word, Excel, and HTML while preserving structural information:

# Directory layout of DeepDoc parsers
deepDoc/
├── parser/
│   ├── pdf_parser.py      # layout analysis + OCR
│   ├── docx_parser.py     # structured extraction
│   ├── excel_parser.py    # table handling
│   └── html_parser.py     # web page structure
└── vision/
    └── layout_recognizer.py  # page layout detection

It extracts titles, tables, images, lists, and paragraphs, enabling downstream modules to work with semantically rich chunks.

End‑to‑end document processing pipeline

# rag/svr/task_executor.py – core pipeline
async def build_chunks(task, progress_callback):
    # 1. Choose parser based on document type
    chunker = FACTORY[task["parser_id"].lower()]
    # 2. Retrieve binary from storage
    binary = await get_storage_binary(...)
    # 3. Parse with DeepDoc (async, semaphore‑controlled)
    cks = await trio.to_thread.run_sync(
        lambda: chunker.chunk(..., binary=binary, ...)
    )
    # 4. Embed chunks in batch
    vectors = await embed_chunks_batch(cks, model_name="bge-large-zh-v1.5")
    # 5. Store in Elasticsearch for retrieval
    await store_chunks_to_es(cks, vectors, task["kb_id"])
    # 6. Update metadata and report progress
    await update_document_status(...)
    return len(cks)

Project structure

ragflow/
├── web/                     # Front‑end (React + TypeScript)
├── api/                     # Backend API services
│   ├── apps/                # Business modules (KB, dialogue, …)
│   ├── db/                  # Database models and services
│   └── ragflow_server.py    # Main service entry
├── deepdoc/                 # Deep document understanding
│   ├── parser/               # PDF, Word, Excel, HTML parsers
│   └── vision/               # Visual layout module
├── rag/                     # Core RAG engine
│   ├── nlp/                  # NLP utilities (tokenization, search)
│   ├── flow/                 # Data processing pipeline
│   └── llm/                  # LLM integration
├── agent/                    # Intelligent agent framework
│   ├── component/            # Agent components
│   ├── tools/                # Tool implementations
│   └── canvas.py             # Workflow execution engine
├── docker/                  # Docker deployment configuration
├── conf/                    # System configuration files
└── mcp/                     # MCP protocol service (model calls)

Retrieval implementation details

Key methods in Searcher illustrate the full RAG flow:

class Searcher:
    async def search(self, query: str, top_k: int = 10, chat_history: List = None) -> SearchResult:
        # 1. Intelligent query processing (multi‑turn)
        processed_query = await self._intelligent_query_processing(query, chat_history)

        # 2. Parallel multi‑recall
        vector_results = await self._vector_search(processed_query, top_k * 3)
        bm25_results   = await self._bm25_search(processed_query, top_k * 3)
        hybrid_results = await self._hybrid_search(processed_query, top_k * 3)
        graph_results  = await self._graph_search(processed_query, top_k * 2)

        # 3. Fusion, deduplication and ranking
        fused = self._intelligent_fusion(...)

        # 4. Optional reranking with a dedicated model
        if self.reranker:
            fused = await self._rerank(fused, processed_query)

        # 5. Post‑processing (snippet extraction, filtering)
        final = await self._post_process_results(fused[:top_k], processed_query)

        return SearchResult(chunks=final, query_info=processed_query, performance_metrics={...})

Agent workflow execution

The canvas engine builds a component graph from a DSL and executes it using the ReAct loop:

# agent/canvas.py – core execution
class Canvas(Graph):
    def __init__(self, dsl: str, tenant_id=None, task_id=None):
        self.dsl = json.loads(dsl) if isinstance(dsl, str) else dsl
        self.tenant_id = tenant_id
        self.task_id = task_id
        self._build_component_graph()
        self.execution_context = ExecutionContext()
        self.tool_registry = self._init_tool_registry()

    async def run(self, **kwargs) -> AgentExecutionResult:
        start_node = self._find_start_node()
        self.execution_context.initialize(initial_input=kwargs, tenant_id=self.tenant_id, task_id=self.task_id)
        final_result = await self._execute_node(start_node, {})
        return AgentExecutionResult(success=True, final_output=final_result, ...)

    async def _execute_node(self, node_id: str, inputs: dict) -> dict:
        component = self.get_node(node_id)
        node_inputs = self._prepare_node_inputs(node_id, inputs)
        output = await component.run(node_inputs)
        if component.component_name == "LLM" and "tool_calls" in output:
            output = await self._handle_tool_calls(output, node_id)
        next_nodes = self.get_outgoing_nodes(node_id)
        if next_nodes:
            return await self._execute_node(next_nodes[0], output)
        return output

Summary

RAGFlow combines precise document parsing, semantic‑aware chunking, traceable citations, multi‑modal retrieval, and programmable agent workflows to overcome the “forgetful” nature of vanilla LLMs. The open‑source codebase demonstrates concrete implementations of each component, making RAGFlow a practical reference for building production‑grade Retrieval‑Augmented Generation solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM knowledge base Retrieval Augmented Generation RAGFlow Agent Workflow DeepDoc Intelligent Chunking

Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.