How BookRAG Redefines Long-Document Retrieval with Hierarchical Indexing
BookRAG introduces a hierarchical, structure‑aware indexing method that combines tree‑based document representation with graph‑based entity linking and an agent‑driven retrieval pipeline, achieving up to 71.2% recall improvement on multimodal long‑document benchmarks while cutting token usage and latency dramatically.
https://arxiv.org/pdf/2512.03413
https://github.com/sam234990/BookRAGProblem Statement
Traditional Retrieval‑Augmented Generation (RAG) applied to PDFs suffers from two major limitations:
PDFs are treated as plain text, discarding hierarchical semantics such as chapters, sections, figures, and nested tables.
The retrieval pipeline is static and cannot simultaneously satisfy definition‑style queries and cross‑chapter reasoning, leading to over‑ or under‑retrieval.
Figure 1: Pure‑text RAG (a) and layout‑segmented RAG (b) cannot preserve both structural dependencies and cross‑modal relations, whereas BookRAG (c) natively perceives hierarchy.
State‑of‑the‑Art Results
On three multimodal long‑document benchmarks, BookRAG achieves up to a 71.2 % increase in retrieval recall on the M3DocVQA benchmark, surpassing previous SOTA methods.
Technical Solution: BookIndex × Agent‑Based Retrieval
BookIndex – Splitting a Book into a Tree and a Graph
First parse the layout into a tree, then extract entities to build a graph, and finally bind them bidirectionally with GT‑Link.
Tree T : Preserves document hierarchy (chapter → section → paragraph → figure/table). Built using layout parsing followed by LLM‑based hierarchical correction.
Graph G : Captures entity relations and supports multi‑hop reasoning. Constructed with gradient‑based entity disambiguation (Algorithm 1).
GT‑Link M : Maps entities back to tree nodes, enabling dual “structure‑semantic” localization. Maintained incrementally with merge‑on‑update semantics.
Gradient‑Based Entity Disambiguation
Retrieve vectors for all entities, then examine the rerank score curve to detect a sharp “cliff”. This automatically separates same‑class from different‑class entities.
Complexity is O(n) for full‑graph disambiguation, avoiding the traditional O(n²) pairwise comparison.
Agent‑Based Retrieval – Human‑like Book Flipping
Planning → scent/filter foraging → answer synthesis, with Pareto‑frontier (Skyline) filtering that retains only non‑dominated multi‑dimensional solutions.
The system is inspired by Information Foraging Theory (IFT). Queries are classified into three types and processed by dynamically assembled operator chains:
Single‑hop (e.g., “What is the definition of Information Scent?”): Extract → Select_by_Entity → Reason → Reduce Multi‑hop (e.g., “Difference between Transformer and RNN long‑range dependencies?”): Decompose → (Single‑hop)×n → Map → Reduce Global (e.g., “How many figures are in the first 10 pages?”):
Filter_Range → Filter_Modal → Map → ReduceMulti‑Dimensional Skyline Filtering
Considers both graph‑node importance and textual semantic relevance.
Retains blocks that are optimal in at least one dimension; a candidate set of roughly 10 blocks is sufficient to reach top‑line performance (see Table 6 in the paper).
BookRAG’s operator library defines four categories—Formulator, Selector, Reasoner, Synthesizer—and provides a step‑by‑step execution trace for a single‑hop query on the MMLongBench dataset, visualizing planning and incremental operator execution.
Efficiency and Scalability
Token consumption : < 5 M tokens vs. 53 M for the strongest baseline (DocETL) → 10× savings .
Query latency : Comparable to baseline (1×) while baseline is 2× slower → 2× speedup .
GPU memory : Runs on 8 × A5000 24 GB GPUs, same configuration as baseline, showing no extra memory overhead.
Figure 5: BookRAG maintains multimodal capability while reducing computational cost to a practical range.
Key Takeaway
BookRAG first builds a hierarchical “catalog” of the document, then uses an agent to navigate the catalog, jump across chapters, and synthesize answers. This combination of structure‑aware indexing and dynamic foraging retrieval yields high accuracy, high recall, and low computational cost for long‑document question answering.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
