Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights

Doc‑V* introduces a dynamic, thumbnail‑driven approach that lets a model decide which pages to read, achieving a 49.7% improvement over RAG variants on multi‑page document QA benchmarks without larger models or longer context windows, and demonstrates how strategic evidence acquisition outperforms naïve full‑document reading.

Machine Heart
Machine Heart
Machine Heart
Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights

In multi‑page document understanding, the prevailing assumption is that a model must "see" as many pages as possible, yet humans rarely read a long report page‑by‑page; they skim the table of contents, locate relevant sections, and then read in depth. The article questions why existing models cannot adopt a similar strategy.

The authors identify two dominant paradigms: (1) static input, where all pages are fed to the model at once, leading to high computational cost and information‑forgetting as document length grows; (2) retrieval‑only methods that select a subset of pages before inference, which suffer when crucial pages are missed because the model cannot recover later. Both lack dynamic, on‑the‑fly information acquisition.

Doc‑V* proposes a new paradigm that shifts from "static reading" to "active exploration". First, a <global thumbnail overview> compresses each page into a low‑resolution thumbnail arranged in a grid, giving the model a cheap global view of document structure. Then two interactive operations are defined: <retrieval_page>: coarse‑grained semantic search over the whole document to return the top‑k most relevant yet unseen pages, supporting multi‑round query refinement. <fetch_page>: deterministic fetch of high‑resolution images for explicitly requested page indices, enabling precise evidence location based on structural cues from thumbnails.

These operations complement each other: retrieval casts a wide net for potential evidence, while fetch hones in on specific pages (e.g., tables, figures, or explicit page‑number queries). Training uses a two‑stage SFT+GRPO strategy so the model learns when to invoke each operation and how to integrate accumulated evidence.

Experiments using Qwen2.5‑VL 7B as backbone show that Doc‑V* outperforms RAG variants by 49.7% on several multi‑page QA benchmarks, without requiring larger models or longer context windows. Analysis of page‑count vs. performance reveals that RAG methods improve initially as more pages are added but then degrade due to information overload, whereas Doc‑V* remains stable because it only incorporates new pages when needed.

Further results across datasets (SlideVQA, LongDocURL, MMLongBench‑Doc) confirm that static retrieval suffers from a delicate balance between coverage and noise, while Doc‑V*’s dynamic evidence acquisition consistently mitigates this trade‑off. The authors argue that the key to effective long‑document QA is "strategy‑driven information retrieval" rather than indiscriminate content stacking.

Overall, Doc‑V* demonstrates that enabling a model to decide where and when to look—mirroring human reading behavior—yields more efficient and reliable reasoning on long documents, offering a promising direction for future document‑understanding research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIRAGdocument understandinginteractive retrievalmulti-page QAthumbnail navigation
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.