How BookRAG Redefines Long-Document Retrieval with Hierarchical Indexing

BookRAG introduces a hierarchical, structure‑aware indexing method that combines tree‑based document representation with graph‑based entity linking and an agent‑driven retrieval pipeline, achieving up to 71.2% recall improvement on multimodal long‑document benchmarks while cutting token usage and latency dramatically.

PaperAgent
PaperAgent
PaperAgent
How BookRAG Redefines Long-Document Retrieval with Hierarchical Indexing
https://arxiv.org/pdf/2512.03413
https://github.com/sam234990/BookRAG

Problem Statement

Traditional Retrieval‑Augmented Generation (RAG) applied to PDFs suffers from two major limitations:

PDFs are treated as plain text, discarding hierarchical semantics such as chapters, sections, figures, and nested tables.

The retrieval pipeline is static and cannot simultaneously satisfy definition‑style queries and cross‑chapter reasoning, leading to over‑ or under‑retrieval.

Figure 1: Comparison of three technical routes
Figure 1: Comparison of three technical routes

Figure 1: Pure‑text RAG (a) and layout‑segmented RAG (b) cannot preserve both structural dependencies and cross‑modal relations, whereas BookRAG (c) natively perceives hierarchy.

State‑of‑the‑Art Results

On three multimodal long‑document benchmarks, BookRAG achieves up to a 71.2 % increase in retrieval recall on the M3DocVQA benchmark, surpassing previous SOTA methods.

Performance comparison
Performance comparison

Technical Solution: BookIndex × Agent‑Based Retrieval

BookIndex – Splitting a Book into a Tree and a Graph

Two‑stage BookIndex construction process
Two‑stage BookIndex construction process

First parse the layout into a tree, then extract entities to build a graph, and finally bind them bidirectionally with GT‑Link.

Tree T : Preserves document hierarchy (chapter → section → paragraph → figure/table). Built using layout parsing followed by LLM‑based hierarchical correction.

Graph G : Captures entity relations and supports multi‑hop reasoning. Constructed with gradient‑based entity disambiguation (Algorithm 1).

GT‑Link M : Maps entities back to tree nodes, enabling dual “structure‑semantic” localization. Maintained incrementally with merge‑on‑update semantics.

Gradient‑Based Entity Disambiguation

Retrieve vectors for all entities, then examine the rerank score curve to detect a sharp “cliff”. This automatically separates same‑class from different‑class entities.

Complexity is O(n) for full‑graph disambiguation, avoiding the traditional O(n²) pairwise comparison.

Agent‑Based Retrieval – Human‑like Book Flipping

Overall retrieval workflow
Overall retrieval workflow

Planning → scent/filter foraging → answer synthesis, with Pareto‑frontier (Skyline) filtering that retains only non‑dominated multi‑dimensional solutions.

The system is inspired by Information Foraging Theory (IFT). Queries are classified into three types and processed by dynamically assembled operator chains:

Single‑hop (e.g., “What is the definition of Information Scent?”): Extract → Select_by_Entity → Reason → Reduce Multi‑hop (e.g., “Difference between Transformer and RNN long‑range dependencies?”): Decompose → (Single‑hop)×n → Map → Reduce Global (e.g., “How many figures are in the first 10 pages?”):

Filter_Range → Filter_Modal → Map → Reduce

Multi‑Dimensional Skyline Filtering

Considers both graph‑node importance and textual semantic relevance.

Retains blocks that are optimal in at least one dimension; a candidate set of roughly 10 blocks is sufficient to reach top‑line performance (see Table 6 in the paper).

BookRAG’s operator library defines four categories—Formulator, Selector, Reasoner, Synthesizer—and provides a step‑by‑step execution trace for a single‑hop query on the MMLongBench dataset, visualizing planning and incremental operator execution.

Efficiency and Scalability

Token consumption : < 5 M tokens vs. 53 M for the strongest baseline (DocETL) → 10× savings .

Query latency : Comparable to baseline (1×) while baseline is 2× slower → 2× speedup .

GPU memory : Runs on 8 × A5000 24 GB GPUs, same configuration as baseline, showing no extra memory overhead.

Cost‑effective long‑document reasoning
Cost‑effective long‑document reasoning

Figure 5: BookRAG maintains multimodal capability while reducing computational cost to a practical range.

Key Takeaway

BookRAG first builds a hierarchical “catalog” of the document, then uses an agent to navigate the catalog, jump across chapters, and synthesize answers. This combination of structure‑aware indexing and dynamic foraging retrieval yields high accuracy, high recall, and low computational cost for long‑document question answering.

LLMRetrieval-Augmented Generationmultimodal retrievalHierarchical IndexingAgent RetrievalLong Document QA
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.