Boost RAG Answer Accuracy: Detailed Step‑by‑Step GraphRAG Knowledge‑Graph Construction

This article walks through the complete GraphRAG knowledge‑graph building pipeline—text splitting, entity extraction, relation mining, community clustering, and report generation—using a concrete example from the book “The Age of Big Data,” and explains why each step improves retrieval and answer quality.

Fun with Large Models
Fun with Large Models
Fun with Large Models
Boost RAG Answer Accuracy: Detailed Step‑by‑Step GraphRAG Knowledge‑Graph Construction

GraphRAG Knowledge‑Graph Construction Overview

GraphRAG converts raw text into a knowledge graph so that retrieval can span multiple contexts and answer complex questions more accurately than plain text chunks.

1. Text Splitting (Text Unit Creation)

The source document is divided into fixed‑size Text Units of 50 tokens each. For the example book description the three resulting units are:

Text Unit 1 : “《大数据时代》是一本由维克托·迈尔‑舍恩伯格与肯尼斯·库克耶合著的书籍,讨论了如何在海量数据中挖掘出有价值的信息。”

Text Unit 2 : “这本书深入探讨了数据科学的应用,并阐述了数据分析和预测在各行各业中的影响力。”

Text Unit 3 : “在书中,作者举了许多实际例子,说明大数据如何改变我们的生活甚至如何预测未来的趋势。”

Each unit is recorded in a Text Block Table (id, human_readable_id, text, n_tokens, document_ids).

2. Entity Extraction

A large language model processes each Text Unit and extracts named entities (people, books, concepts, events). The extracted entities for the example are:

e1 – 大数据时代 (type: 书籍, description: “《大数据时代》是一本关于大数据应用的书籍,作者讨论了数据如何改变世界.”, text_unit_ids: [0,1])

e2 – 维克托·迈尔‑舍恩伯格 (type: 人物, description: “大数据领域的专家,合著《大数据时代》.”, text_unit_ids: [0,1])

e3 – 肯尼斯·库克耶 (type: 人物, description: “大数据领域的专家,合著《大数据时代》.”, text_unit_ids: [0,1])

e4 – 数据科学 (type: 事件, description: “《大数据时代》讨论的核心议题.”, text_unit_ids: [1])

e5 – 数据分析 (type: 事件, description: “数据科学的具体实践方法,在大数据时代价值放大.”, text_unit_ids: [1,2])

3. Relation Extraction

Pairs of entities are linked with a semantic relation, forming triples. Example triple for the sentence “维克托·迈尔‑舍恩伯格是《大数据时代》一书的作者”:

实体A: 维克托·迈尔‑舍恩伯格 (人)
实体B: 《大数据时代》 (书籍)
关系: 是…的作者

The resulting Relation Table (id, source, target, description, weight, combined_degree, text_unit_ids) contains:

relation_1: source = 维克托·迈尔‑舍恩伯格, target = 《大数据时代», description = 作者, weight = 0.9, combined_degree = 1, text_unit_ids = [0,1]

relation_2: source = 肯尼斯·库克耶, target = 《大数据时代», description = 作者, weight = 0.7, combined_degree = 1, text_unit_ids = [1]

relation_3: source = 《大数据时代», target = 数据科学, description = 探讨, weight = 0.65, combined_degree = 1, text_unit_ids = [2]

relation_4: source = 《大数据时代», target = 数据分析, description = 阐述, weight = 0.65, combined_degree = 1, text_unit_ids = [2]

4. Graph Metrics – Degree and Level

For each entity the degree (number of incident edges) and level (hierarchical depth) are computed:

大数据时代: degree = 4, level = 1 (core entity)

维克托·迈尔‑舍恩伯格, 肯尼斯·库克耶, 数据科学, 数据分析: degree = 1, level = 2 (peripheral)

These metrics drive a simple community clustering: high‑degree/low‑level nodes form **Community 1** (core), the rest form **Community 2** (outer).

5. Community Report Generation

Two reports are produced based on the community assignment.

Core Community Report (community 1)

title: 大数据时代的影响

summary: 围绕《大数据时代》一书展开,讨论数据科学、数据分析的应用及其行业影响。

full_content: 书中通过多个实际案例分析大数据的应用场景,重点讲解如何预测未来趋势。

rank: 1 (central)

Peripheral Community Report (community 2)

title: 《大数据时代》背后的专家与理论

summary: 介绍作者维克托·迈尔‑舍恩伯格、肯尼斯·库克耶以及学科“数据科学”“数据分析”,说明它们对核心书籍的支撑作用。

full_content: 作者在数据科学和数据分析领域的影响,以及这些学科如何支撑书中的核心观点。

rank: 2 (outer)

Each report includes fields such as community, level, title, summary, full_content, rank, timestamps, and optional JSON payloads.

6. Final Knowledge‑Graph Assembly

All tables—Text Block, Entity, Relation, Entity‑Relation, and Community Report—are merged into a single graph. The graph can be queried for multi‑hop reasoning, providing richer context than raw text chunks.

Illustrative Workflow Diagram

The example demonstrates why mastering each construction step—text splitting, entity extraction, relation mining, degree/level analysis, community clustering, and report generation—yields higher answer accuracy and more powerful retrieval in GraphRAG systems.

RAGKnowledge Graphentity extractionGraphRAGcommunity clusteringrelation mining
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.