Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights
Doc‑V* introduces a dynamic, thumbnail‑driven approach that lets a model decide which pages to read, achieving a 49.7% improvement over RAG variants on multi‑page document QA benchmarks without larger models or longer context windows, and demonstrates how strategic evidence acquisition outperforms naïve full‑document reading.
In multi‑page document understanding, the prevailing assumption is that a model must "see" as many pages as possible, yet humans rarely read a long report page‑by‑page; they skim the table of contents, locate relevant sections, and then read in depth. The article questions why existing models cannot adopt a similar strategy.
The authors identify two dominant paradigms: (1) static input, where all pages are fed to the model at once, leading to high computational cost and information‑forgetting as document length grows; (2) retrieval‑only methods that select a subset of pages before inference, which suffer when crucial pages are missed because the model cannot recover later. Both lack dynamic, on‑the‑fly information acquisition.
Doc‑V* proposes a new paradigm that shifts from "static reading" to "active exploration". First, a <global thumbnail overview> compresses each page into a low‑resolution thumbnail arranged in a grid, giving the model a cheap global view of document structure. Then two interactive operations are defined: <retrieval_page>: coarse‑grained semantic search over the whole document to return the top‑k most relevant yet unseen pages, supporting multi‑round query refinement. <fetch_page>: deterministic fetch of high‑resolution images for explicitly requested page indices, enabling precise evidence location based on structural cues from thumbnails.
These operations complement each other: retrieval casts a wide net for potential evidence, while fetch hones in on specific pages (e.g., tables, figures, or explicit page‑number queries). Training uses a two‑stage SFT+GRPO strategy so the model learns when to invoke each operation and how to integrate accumulated evidence.
Experiments using Qwen2.5‑VL 7B as backbone show that Doc‑V* outperforms RAG variants by 49.7% on several multi‑page QA benchmarks, without requiring larger models or longer context windows. Analysis of page‑count vs. performance reveals that RAG methods improve initially as more pages are added but then degrade due to information overload, whereas Doc‑V* remains stable because it only incorporates new pages when needed.
Further results across datasets (SlideVQA, LongDocURL, MMLongBench‑Doc) confirm that static retrieval suffers from a delicate balance between coverage and noise, while Doc‑V*’s dynamic evidence acquisition consistently mitigates this trade‑off. The authors argue that the key to effective long‑document QA is "strategy‑driven information retrieval" rather than indiscriminate content stacking.
Overall, Doc‑V* demonstrates that enabling a model to decide where and when to look—mirroring human reading behavior—yields more efficient and reliable reasoning on long documents, offering a promising direction for future document‑understanding research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
