How to Build and Index Microsoft GraphRAG with Neo4j: A Step‑by‑Step Guide
This article explains the fundamentals of Microsoft GraphRAG, details its indexing pipeline—including text chunking, entity‑relationship extraction, community detection, and description generation—shows how to set up the graphrag library, create adaptive prompts, build the index, and import the resulting graph into Neo4j for visualization and analysis.
Overview
Microsoft GraphRAG is an open‑source framework that extends classic Retrieval‑Augmented Generation (RAG) by first converting raw documents into a knowledge graph and then summarizing the graph back into natural language. This enables query‑focused summarization (QFS) tasks that require high‑level semantic understanding.
Core Principles
The Microsoft paper "From Local to Global: A Graph RAG Approach to Query‑Focused Summarization" identifies the limitation of classic RAG in answering questions such as "What topics do these datasets express?" and proposes a two‑stage pipeline:
Build a knowledge graph from multiple source documents.
Detect communities in the graph (e.g., using the Leiden algorithm) and generate a natural‑language summary for each community. Queries are answered by retrieving community‑level information and aggregating it.
Indexing Stage
The indexing phase consists of five steps:
Text chunking : Split raw documents into smaller blocks, similar to classic RAG.
Entity & relationship extraction : Use a large language model (LLM) to identify entities and relationships within each chunk.
Generate entity/relationship descriptions : Produce concise descriptive texts for each entity and relationship; these are stored as description properties on graph nodes.
Community detection : Apply a community‑detection algorithm (Leiden via the Graspologic library) to group related entities.
Community summarization : Use an LLM to create a natural‑language report for each community, which serves as the answer source for QFS queries.
Descriptions can be embedded (e.g., description_embedding) to improve vector‑based retrieval.
Setting Up the graphrag Library
All commands assume a Python virtual environment.
python -m graphrag.index --init --root ./msgraphragAfter initialization the msgraphrag directory contains:
input : place raw .txt or .csv documents here.
.env and settings.yaml : configure LLM access (OpenAI or Azure OpenAI).
prompts : four LLM prompt templates— entity_extraction, summarize_descriptions, community_report, and optional claim_extraction.
Adaptive prompts can be generated with:
python -m graphrag.prompt_tune --language ChineseCreating the Index
python -m graphrag.index --root ./msgraphragThe command writes Parquet files to the output folder. These files are later loaded into memory and vector stores for retrieval.
Importing into Neo4j
Parquet files can be read with Pandas and imported into Neo4j via Cypher. A community‑provided Jupyter notebook automates this process.
After import the Neo4j graph contains node types Entity, Community, Chunk, Document and relationship types RELATED, PART_OF, HAS_ENTITY, IN_COMMUNITY. In the demo dataset 690 nodes and 1,793 relationships were created.
Example Cypher query to list the top‑degree entities:
MATCH (n:__Entity__) RETURN n.name AS name, count((n)-[:RELATED]-()) AS degree ORDER BY degree DESC LIMIT 10Technical Implementation Details
GraphRAG uses the DataShaper library to define a workflow of “verbs” (processing actions). Core workflow definitions reside in index/workflows/v1; individual verb implementations are in index/verbs. Data exchange during indexing relies on Pandas DataFrame. Graph construction and analysis use networkx; the graph is serialized to GraphML. Community detection uses the Leiden algorithm from the graspologic library.
Performance Considerations
Generating descriptions for every entity and relationship can require many LLM calls, increasing latency and cost. For testing, a lightweight model such as gpt‑4o‑mini is recommended.
Reference Notebook
GitHub notebook for Neo4j import: https://github.com/tomasonjo/blogs/blob/master/msft_graphrag/ms_graphrag_import.ipynb
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
