Stop Fragmenting Docs: How Tree‑Based PageIndex Improves RAG Accuracy and Efficiency
The article explains why breaking documents into countless semantic fragments harms retrieval‑augmented generation, introduces PageIndex’s tree‑structured, inference‑driven approach as a superior alternative, and provides detailed setup, usage, and integration instructions for both local and production environments.
When building Retrieval‑Augmented Generation (RAG) applications, converting documents into a mass of "semantic fragments"—splitting, embedding, and relying on ANN search—often degrades quality and inflates costs, especially with large corpora where token usage can skyrocket while the model hallucinates near‑correct answers.
PageIndex offers a different solution: representing each document as a hierarchical tree, essentially an LLM‑optimized table of contents. The model traverses the tree step‑by‑step (e.g., "we are in 'Risk Factors', then 'Liquidity', then 'Contract Default'…") and extracts context only from the branches it truly needs, eliminating embedding drift, irrelevant chunk matches, and token waste.
Mafin 2.5, a reasoning‑based RAG system powered by PageIndex, achieved 98.7% accuracy on the FinanceBench benchmark, far surpassing traditional vector‑based RAG systems.
How PageIndex Works
(1) Tree Generation (Indexing) : PDFs are parsed into a hierarchical node structure (chapters/sub‑chapters) with metadata such as title, node_id, page_index, and text.
(2) Inference‑Based Retrieval : The LLM decides which node(s) to open next (tree search) and can optionally return a reasoning trace plus a list of node_ids to use.
Benefits include precise accuracy without brute‑force embedding, better user‑facing explanations, lower latency and token consumption, and structure‑aware authenticity—crucial for complex financial filings where location conveys meaning.
Repository Layout
run_pageindex.py: entry script to index a PDF locally. pageindex/: core library. cookbook/: example notebooks and demos. tutorials/: guided examples. tests/: sample PDFs and expected outputs for regression testing.
Getting Started
Create a virtual environment and install dependencies:
python3 -m venv venv
source venv/bin/activate
pip install pageindexAdd an .env file at the repository root containing your OpenAI key: OPENAI_API_KEY=your_key_here Run the indexing script:
python run_pageindex.py --pdf_path "path/to/your.pdf"Optional arguments let you choose model and language, e.g.:
python run_pageindex.py --pdf_path "path/to/your.pdf" --model "gpt-4o" --language "en"Generate a tree from a markdown file with the -md_path flag:
python run_pageindex.py --md_path "path/to/your.md"Index + Retrieval Workflow
(1) Index / Tree Generation – Input: PDF (currently only PDF). Output: hierarchical node tree.
(2) Retrieval (Tree Search) – Provide a query plus the tree; the model returns a JSON containing thinking and node_list.
(3) Answer Synthesis – Collect the selected node texts (or page‑level content) and feed them into the final answer prompt.
Production Integration (SDK + HTTP API)
Python SDK (hosted service) :
from pageindex import PageIndex
pi_client = PageIndex(api_key="YOUR_API_KEY")
doc_id = pi_client.upload_document("path/to/report.pdf")
# Check status
pi_client.get_document(doc_id)
# Retrieve tree
tree = pi_client.get_tree(doc_id)
# OCR output
ocr = pi_client.get_ocr(doc_id, format="page|node|raw")REST Endpoints (hosted service) :
POST https://api.pageindex.ai/doc/ – upload PDF, returns doc_id.
GET https://api.pageindex.ai/doc/{doc_id}/?type=tree – retrieve processing status and tree (optionally with summary).
GET
https://api.pageindex.ai/doc/{doc_id}/?type=ocr&format=page|node|raw– fetch OCR results.
POST https://api.pageindex.ai/chat/completions – send messages with optional doc_id and stream parameters.
Multi‑Document Search Modes
PageIndex primarily targets single‑document, inference‑based RAG, but also supports three multi‑document workflows:
Metadata search – when documents are distinguishable by metadata.
Semantic search – when topics differ across documents.
Description search – lightweight approach for a small set of documents.
Typical pipeline: select candidate documents (metadata/semantic/description), run tree search on the chosen set, then synthesize the answer.
Further Resources
Cookbooks – hands‑on examples and advanced use cases.
Tutorials – practical guides covering Document Search and Tree Search.
Blog – technical articles, research insights, and product updates.
MCP setup & API docs – integration details and configuration options.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
