Artificial Intelligence 11 min read

How Microsoft’s PIKE‑RAG Builds Knowledge‑Driven AI Across Four Stages

The article explains Microsoft’s open‑source PIKE‑RAG system, detailing its four progressive stages—from knowledge‑base construction to creative multi‑agent reasoning—while describing the underlying modules, chunking strategies, multi‑granularity retrieval, and code snippets that enable specialized domain understanding and inference.

Ma Wei Says

Feb 23, 2025

How Microsoft’s PIKE‑RAG Builds Knowledge‑Driven AI Across Four Stages

This piece continues the discussion from the previous article on Microsoft’s open‑source PIKE‑RAG, outlining a hierarchical, four‑stage strategy (L0‑L4) for building a Retrieval‑Augmented Generation system that progressively enhances domain knowledge understanding and reasoning.

L0: Knowledge Base Construction

The foundation focuses on creating a comprehensive, reliable knowledge base by converting domain documents into machine‑readable formats and organizing them into a heterogeneous graph that supports advanced reasoning and retrieval.

1. Document Parsing

Multiple data sources are parsed using tools like LangChain, OCR APIs, and table extraction utilities. Complex tables and figures are retained as multimodal elements and described with visual‑language models (VLMs) to preserve document integrity and improve search effectiveness.

2. Knowledge Organization

The knowledge base adopts a multi‑layer heterogeneous graph comprising an Information Resource Layer, a Corpus Layer, and a Distilled Knowledge Layer, each offering different granularity and abstraction.

Information Resource Layer : records raw data sources as nodes and edges, enabling cross‑validation and reasoning.

Corpus Layer : splits documents into sections and blocks while preserving original hierarchy; tables and figures are summarized by large language models (LLMs) and added as nodes.

Distilled Knowledge Layer : extracts entities and relations to form knowledge graphs, atomic knowledge, and tabular knowledge for deep inference.

知识图谱：使用LLMs提取实体和关系，形成“节点-边-节点”结构，构建图谱。</code>
<code>原子知识：将文本拆成原子语句，结合节点关系生成原子知识。</code>
<code>表格知识：提取具有指定类型和关系的实体对，组合以构建表格知识。

L1: Fact‑Based Question Core

Building on L0, L1 adds knowledge retrieval and organization to improve retrieval‑augmented generation. The main challenges are semantic alignment and accurate chunking of specialized terminology.

1. Enhanced Chunking

Documents are split into smaller blocks using fixed‑size, semantic, or hybrid chunking. Proper chunking serves two purposes: (1) creates vectorized units for retrieval; (2) provides a basis for downstream knowledge extraction and summarization. Incorrect chunking can lose context, especially in legal or regulatory texts.

First iteration generates a forward summary for each initial chunk, which becomes context for subsequent chunks.

Each chunk then receives an independent summary; the process repeats until the entire document is processed, dynamically adjusting chunk size based on content.

2. Automatic Tagging

In specialized domains, the corpus uses technical language while user queries are everyday phrasing. An automatic tagging module extracts comprehensive domain‑specific tags or builds tag‑mapping rules using LLMs, narrowing the gap between queries and documents and improving retrieval accuracy.

3. Multi‑Granularity Retrieval

L1 supports cross‑heterogeneous‑graph retrieval at multiple layers, allowing queries to target the whole document or specific blocks. Similarity scores are computed between the query and nodes, with information propagated and aggregated across layers to balance breadth and depth.

L2: Chain‑Reasoning Question Core

L2 focuses on efficient multi‑source retrieval and complex reasoning by introducing a knowledge extraction module and a task‑decomposition coordination module.

Knowledge atomization: LLMs generate question tags for each chunk, forming a hierarchical knowledge base that supports fine‑grained queries.

Knowledge‑aware task decomposition

Knowledge‑aware task decomposer training

L3: Predictive Question Core

L3 aims to boost predictive capabilities. Structured and summarization sub‑modules transform raw knowledge into clear formats (e.g., drug name and approval date in FDA scenarios), enabling the system to forecast outcomes such as future drug approvals.

L4: Creative Question Core

L4 introduces a multi‑agent mechanism that enables diverse perspectives and creative reasoning. Multiple specialized agents collaboratively analyze and synthesize knowledge, producing comprehensive solutions for open‑ended problems.

The article is a translation and summary of the paper “PIKE‑RAG: sPecIalized KnowledgE and Rationale Augmented Generation”. For more details, see the open‑source repository and the paper linked below.

GitHub link: https://github.com/microsoft/PIKE-RAG</code>
<code>Paper link: https://arxiv.org/abs/2501.11551

LLM RAG Knowledge Graph AI retrieval multistage architecture PIKE-RAG

Written by

Ma Wei Says

Follow me! Discussing software architecture and development, AIGC and AI Agents... Sometimes sharing insights on IT professionals' life experiences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.