RAGFlow Deep Dive: Data Parsing and Knowledge Graph Construction

This article examines RAGFlow's end‑to‑end pipeline for turning diverse documents into structured knowledge, detailing the TaskExecutor factory, the DeepDoc layout‑aware parser, chunking strategies, embedding and storage mechanisms, and the GraphRAG‑based knowledge‑graph extraction that together enable high‑precision retrieval and reasoning.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
RAGFlow Deep Dive: Data Parsing and Knowledge Graph Construction

Document Parsing Architecture – TaskExecutor

The rag/svr/task_executor.py module defines a TaskExecutor class that implements a factory pattern. Its FACTORY dictionary maps a document type string to a dedicated parser implementation (e.g., naive, paper, book, laws, manual, email, picture, audio, video, etc.). This design provides:

Extensibility : new parsers are added by inserting a new entry into FACTORY.

Specialization : each document type is processed by a purpose‑built parser.

Asynchronous processing : the whole pipeline runs on the Trio framework, enabling high‑concurrency handling of many documents.

class TaskExecutor:
    FACTORY = {
        "naive": naive,
        "paper": paper,
        "book": book,
        "laws": laws,
        "manual": manual,
        "one": one,
        "knowledge_graph": knowledge_graph,
        "email": email,
        "presentation": presentation,
        "picture": picture,
        "audio": audio,
        "video": video,
    }

All processing components reside under the rag/flow/ directory, reflecting a modular pipeline.

Parser and DeepDoc Engine

The parser factory in rag/flow/parser/parser.py selects a concrete parser based on the ParserType enum. Simple text files ( .txt, .md) use the naive parser, while complex formats (PDF, DOCX) invoke the proprietary DeepDoc engine located in deepdoc/.

FACTORY = {
    "general": naive,
    ParserType.NAIVE.value: naive,
    ParserType.PAPER.value: paper,
    ParserType.BOOK.value: book,
    ParserType.PRESENTATION.value: presentation,
    ParserType.MANUAL.value: manual,
    ParserType.LAWS.value: laws,
    ParserType.QA.value: qa,
    ParserType.TABLE.value: table,
    ParserType.RESUME.value: resume,
    ParserType.PICTURE.value: picture,
    ParserType.ONE.value: one,
    ParserType.AUDIO.value: audio,
    ParserType.EMAIL.value: email,
    ParserType.KG.value: naive,
    ParserType.TAG.value: tag,
}

DeepDoc combines computer‑vision and NLP to produce a unified Document object. Its core capabilities are:

Layout analysis : a deep‑learning model detects text blocks, tables, images, headings and their hierarchical relationships.

Table extraction : tables are extracted with rows, columns and semantics preserved, output as Markdown or HTML.

Formula recognition : mathematical expressions are captured for scientific documents.

Image content understanding : OCR extracts text from images and infers semantic meaning.

Example JSON output from DeepDoc:

{
    "page_num": 1,
    "blocks": [
        {"type": "title", "content": "章节标题", "bbox": [x1, y1, x2, y2], "level": 1},
        {"type": "paragraph", "content": "正文内容...", "bbox": [x1, y1, x2, y2]},
        {"type": "table", "content": "<table>...</table>", "bbox": [x1, y1, x2, y2], "markdown": "| 列1 | 列2 |
|-----|-----|
| 值1 | 值2 |"}
    ]
}

Chunking Strategies

After parsing, long texts are split into semantically coherent chunks. The implementation lives in rag/flow/chunker/chunker.py and offers two main strategies:

naive : uses RecursiveCharacterTextSplitter from LangChain to split by character count with configurable overlap.

layout‑aware : leverages DeepDoc’s layout metadata to keep whole paragraphs, tables or headings together, preventing semantic breakage.

def _naive_chunk(text, chunk_size, overlap):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    return splitter.split_text(text)

The layout‑aware approach groups blocks that belong to the same visual region (e.g., a full paragraph or an entire table) into a single chunk, which improves downstream retrieval accuracy.

Embedding and Storage

Chunks are vectorized via an embedding service wrapped by the MCP protocol. The resulting vectors, together with metadata (content, document ID, chunk ID, etc.), are stored in a search engine. Storage adapters are located under rag/utils/*_conn.py; for Elasticsearch the relevant class is EsConnection.

class EsConnection:
    def bulk_insert(self, chunks: List[dict]):
        actions = []
        for chunk in chunks:
            action = {
                "_index": self.index_name,
                "_id": chunk["chunk_id"],
                "_source": {
                    "content": chunk["content"],
                    "vector": chunk["vector"],
                    "doc_id": chunk["doc_id"],
                    # ... other metadata
                }
            }
            actions.append(action)
        from elasticsearch.helpers import bulk
        bulk(self.client, actions)

The TaskExecutor batches chunks, sends them to the embedding service, receives vectors, and writes the enriched documents to Elasticsearch.

Knowledge Graph Construction

In the graphrag/ directory RAGFlow integrates a GraphRAG pipeline that extracts entity‑relation‑entity triples from the parsed chunks using large‑language‑model (LLM) prompting (see graphrag/light/graph_prompt.py and graphrag/light/graph_extractor.py).

The resulting knowledge graph is combined with vector search to provide:

Hybrid retrieval : factual queries can be answered directly from the graph, yielding higher precision than pure vector similarity.

Multi‑hop reasoning : answers that require chaining multiple knowledge points are supported.

Result explanation : graph paths are returned as citations, improving interpretability.

Key Technical Outcomes

TaskExecutor implements a factory‑based, asynchronous document‑processing pipeline.

DeepDoc fuses CV and NLP to handle complex layouts, tables, formulas and images.

Layout‑aware chunking preserves semantic boundaries while keeping chunk size manageable.

Embedding vectors and metadata are bulk‑written to Elasticsearch via a concise connector.

GraphRAG adds structured knowledge extraction, enabling hybrid search, multi‑hop reasoning and explainable results.

LLMElasticsearchEmbeddingKnowledge GraphChunkingData ParsingRAGFlowDeepDoc
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.