Artificial Intelligence 13 min read

How to Build and Index Microsoft GraphRAG with Neo4j: A Step‑by‑Step Guide

This article explains the fundamentals of Microsoft GraphRAG, details its indexing pipeline—including text chunking, entity‑relationship extraction, community detection, and description generation—shows how to set up the graphrag library, create adaptive prompts, build the index, and import the resulting graph into Neo4j for visualization and analysis.

AI Large Model Application Practice

Aug 9, 2024

How to Build and Index Microsoft GraphRAG with Neo4j: A Step‑by‑Step Guide

Overview

Microsoft GraphRAG is an open‑source framework that extends classic Retrieval‑Augmented Generation (RAG) by first converting raw documents into a knowledge graph and then summarizing the graph back into natural language. This enables query‑focused summarization (QFS) tasks that require high‑level semantic understanding.

Core Principles

The Microsoft paper "From Local to Global: A Graph RAG Approach to Query‑Focused Summarization" identifies the limitation of classic RAG in answering questions such as "What topics do these datasets express?" and proposes a two‑stage pipeline:

Build a knowledge graph from multiple source documents.

Detect communities in the graph (e.g., using the Leiden algorithm) and generate a natural‑language summary for each community. Queries are answered by retrieving community‑level information and aggregating it.

Indexing Stage

The indexing phase consists of five steps:

Text chunking : Split raw documents into smaller blocks, similar to classic RAG.

Entity & relationship extraction : Use a large language model (LLM) to identify entities and relationships within each chunk.

Generate entity/relationship descriptions : Produce concise descriptive texts for each entity and relationship; these are stored as description properties on graph nodes.

Community detection : Apply a community‑detection algorithm (Leiden via the Graspologic library) to group related entities.

Community summarization : Use an LLM to create a natural‑language report for each community, which serves as the answer source for QFS queries.

Descriptions can be embedded (e.g., description_embedding) to improve vector‑based retrieval.

Setting Up the graphrag Library

All commands assume a Python virtual environment.

python -m graphrag.index --init --root ./msgraphrag

After initialization the msgraphrag directory contains:

input : place raw .txt or .csv documents here.

.env and settings.yaml : configure LLM access (OpenAI or Azure OpenAI).

prompts : four LLM prompt templates— entity_extraction, summarize_descriptions, community_report, and optional claim_extraction.

Adaptive prompts can be generated with:

python -m graphrag.prompt_tune --language Chinese

Creating the Index

python -m graphrag.index --root ./msgraphrag

The command writes Parquet files to the output folder. These files are later loaded into memory and vector stores for retrieval.

Importing into Neo4j

Parquet files can be read with Pandas and imported into Neo4j via Cypher. A community‑provided Jupyter notebook automates this process.

After import the Neo4j graph contains node types Entity, Community, Chunk, Document and relationship types RELATED, PART_OF, HAS_ENTITY, IN_COMMUNITY. In the demo dataset 690 nodes and 1,793 relationships were created.

Example Cypher query to list the top‑degree entities:

MATCH (n:__Entity__) RETURN n.name AS name, count((n)-[:RELATED]-()) AS degree ORDER BY degree DESC LIMIT 10

Technical Implementation Details

GraphRAG uses the DataShaper library to define a workflow of “verbs” (processing actions). Core workflow definitions reside in index/workflows/v1; individual verb implementations are in index/verbs. Data exchange during indexing relies on Pandas DataFrame. Graph construction and analysis use networkx; the graph is serialized to GraphML. Community detection uses the Leiden algorithm from the graspologic library.

Performance Considerations

Generating descriptions for every entity and relationship can require many LLM calls, increasing latency and cost. For testing, a lightweight model such as gpt‑4o‑mini is recommended.

Reference Notebook

GitHub notebook for Neo4j import: https://github.com/tomasonjo/blogs/blob/master/msft_graphrag/ms_graphrag_import.ipynb

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AI RAG Neo4j GraphRAG

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.