How Microsoft’s PIKE‑RAG Builds Knowledge‑Driven AI Across Four Stages
The article explains Microsoft’s open‑source PIKE‑RAG system, detailing its four progressive stages—from knowledge‑base construction to creative multi‑agent reasoning—while describing the underlying modules, chunking strategies, multi‑granularity retrieval, and code snippets that enable specialized domain understanding and inference.
This piece continues the discussion from the previous article on Microsoft’s open‑source PIKE‑RAG, outlining a hierarchical, four‑stage strategy (L0‑L4) for building a Retrieval‑Augmented Generation system that progressively enhances domain knowledge understanding and reasoning.
L0: Knowledge Base Construction
The foundation focuses on creating a comprehensive, reliable knowledge base by converting domain documents into machine‑readable formats and organizing them into a heterogeneous graph that supports advanced reasoning and retrieval.
1. Document Parsing
Multiple data sources are parsed using tools like LangChain, OCR APIs, and table extraction utilities. Complex tables and figures are retained as multimodal elements and described with visual‑language models (VLMs) to preserve document integrity and improve search effectiveness.
2. Knowledge Organization
The knowledge base adopts a multi‑layer heterogeneous graph comprising an Information Resource Layer, a Corpus Layer, and a Distilled Knowledge Layer, each offering different granularity and abstraction.
Information Resource Layer : records raw data sources as nodes and edges, enabling cross‑validation and reasoning.
Corpus Layer : splits documents into sections and blocks while preserving original hierarchy; tables and figures are summarized by large language models (LLMs) and added as nodes.
Distilled Knowledge Layer : extracts entities and relations to form knowledge graphs, atomic knowledge, and tabular knowledge for deep inference.
知识图谱:使用LLMs提取实体和关系,形成“节点-边-节点”结构,构建图谱。</code>
<code>原子知识:将文本拆成原子语句,结合节点关系生成原子知识。</code>
<code>表格知识:提取具有指定类型和关系的实体对,组合以构建表格知识。L1: Fact‑Based Question Core
Building on L0, L1 adds knowledge retrieval and organization to improve retrieval‑augmented generation. The main challenges are semantic alignment and accurate chunking of specialized terminology.
1. Enhanced Chunking
Documents are split into smaller blocks using fixed‑size, semantic, or hybrid chunking. Proper chunking serves two purposes: (1) creates vectorized units for retrieval; (2) provides a basis for downstream knowledge extraction and summarization. Incorrect chunking can lose context, especially in legal or regulatory texts.
First iteration generates a forward summary for each initial chunk, which becomes context for subsequent chunks.
Each chunk then receives an independent summary; the process repeats until the entire document is processed, dynamically adjusting chunk size based on content.
2. Automatic Tagging
In specialized domains, the corpus uses technical language while user queries are everyday phrasing. An automatic tagging module extracts comprehensive domain‑specific tags or builds tag‑mapping rules using LLMs, narrowing the gap between queries and documents and improving retrieval accuracy.
3. Multi‑Granularity Retrieval
L1 supports cross‑heterogeneous‑graph retrieval at multiple layers, allowing queries to target the whole document or specific blocks. Similarity scores are computed between the query and nodes, with information propagated and aggregated across layers to balance breadth and depth.
L2: Chain‑Reasoning Question Core
L2 focuses on efficient multi‑source retrieval and complex reasoning by introducing a knowledge extraction module and a task‑decomposition coordination module.
Knowledge atomization: LLMs generate question tags for each chunk, forming a hierarchical knowledge base that supports fine‑grained queries.
Knowledge‑aware task decomposition
Knowledge‑aware task decomposer training
L3: Predictive Question Core
L3 aims to boost predictive capabilities. Structured and summarization sub‑modules transform raw knowledge into clear formats (e.g., drug name and approval date in FDA scenarios), enabling the system to forecast outcomes such as future drug approvals.
L4: Creative Question Core
L4 introduces a multi‑agent mechanism that enables diverse perspectives and creative reasoning. Multiple specialized agents collaboratively analyze and synthesize knowledge, producing comprehensive solutions for open‑ended problems.
The article is a translation and summary of the paper “PIKE‑RAG: sPecIalized KnowledgE and Rationale Augmented Generation”. For more details, see the open‑source repository and the paper linked below.
GitHub link: https://github.com/microsoft/PIKE-RAG</code>
<code>Paper link: https://arxiv.org/abs/2501.11551Ma Wei Says
Follow me! Discussing software architecture and development, AIGC and AI Agents... Sometimes sharing insights on IT professionals' life experiences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
