Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

A new Region‑to‑Image Distillation (R2I) approach lets multimodal large language models perceive tiny visual details in a single forward pass, eliminating costly tool calls while achieving state‑of‑the‑art accuracy on the ZoomBench fine‑grained benchmark.

AntTech
AntTech
AntTech
Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

AI Vision Nearsightedness

Current multimodal large language models (MLLMs) excel at understanding broad scenes but struggle with fine‑grained perception: they often miss tiny text, subtle color or texture differences, and make errors counting dense small objects because critical evidence occupies only a tiny region of the image.

Turning Zoom from Inference Tool to Training Objective

To avoid the heavy latency of repeated tool calls in the "Thinking‑with‑Images" paradigm, the authors propose Region‑to‑Image Distillation (R2I). The method first zooms into small patches to generate high‑quality synthetic data, then zooms out to map this data back to the full‑image view and trains the model via reinforcement learning so that, at inference time, the model can directly perceive fine details without any zoom operations.

R2I Workflow

Locate tiny regions: Use an object detector to find patches covering less than 10% of the image that contain key visual evidence.

Generate high‑quality data: Prompt the model on the cropped high‑resolution patches to produce perception‑related questions.

Consensus hallucination filtering: Apply multi‑model voting and keep only high‑confidence annotations to suppress hallucinations.

Eliminate referential ambiguity: Overlay bounding boxes on the original image and add spatial constraints (e.g., "only look at the red box") to the questions.

Difficulty filtering: Use rejection sampling to discard overly easy samples, ensuring efficient training.

Reinforcement‑learning training: Train on the full‑image view with the synthetic Q&A pairs so the model learns to locate and recognize key evidence without any tool calls.

In effect, the model wears a virtual magnifying glass during data synthesis but learns to see clearly with a single forward pass at test time.

ZoomBench: A New Fine‑Grained Perception Benchmark

ZoomBench contains 845 high‑quality samples covering six perception dimensions (fine‑grained counting, OCR, color attributes, structural attributes, material attributes, and object recognition). Each sample is AI‑generated and then manually verified by three authors.

The benchmark adopts a dual‑view evaluation mode: every sample provides both the full‑image (global view) and a cropped key‑region (local view). Accuracy on the local view represents the theoretical upper bound, while accuracy on the global view measures real‑world fine‑grained perception. The gap between them is called the "Zooming Gap" and quantifies how often models miss critical evidence.

ZoomBench also supplies automatic bounding‑box annotations for interpretability analysis.

Performance Comparison

Using Qwen‑VL as the base, the authors fine‑tuned three ZwZ models (4B, 7B, 8B parameters) with R2I‑generated data. Their average scores on comprehensive perception tasks are 76.31 (4B), 72.25 (7B), and 77.64 (8B), with the 8B model approaching the closed‑source Gemini‑3‑Flash.

Importantly, ZwZ‑8B achieves roughly a ten‑fold speedup over traditional "Thinking‑with‑Images" agents because it requires only a single forward pass.

Beyond benchmarks, ZwZ shows strong generalization on real‑world tasks such as AIGC detection and GUI agents.

Deep Insights: When to Use Tools and When Not To

The paper distinguishes two tool categories based on information gain:

Information‑gain tools (e.g., web search) introduce new, unpredictable information and must be used.

Non‑information‑gain tools (e.g., zoom, rotate, denoise) merely reformat existing information and can be internalized into the model during training.

Zooming is a classic non‑information‑gain operation; R2I trains the model to perform "mental zoom" so that no tool call is needed at inference.

Future work should aim to internalize all non‑information‑gain tools while keeping the ability to invoke true information‑gain tools when beneficial.

Conclusion

The study introduces an efficient data‑synthesis pipeline and the Region‑to‑Image Distillation paradigm, converting the costly zoom‑tool at inference into a training objective. This enables multimodal LLMs to achieve fine‑grained visual perception with a single forward pass, offering a practical path toward fast and accurate multimodal understanding.

https://github.com/inclusionAI/Zooming-without-Zooming
https://huggingface.co/collections/inclusionAI/zooming-without-zooming
https://arxiv.org/abs/2602.11858
multimodal AIlarge language modelsModel Efficiencyfine-grained perceptionregion-to-image distillationZoomBench
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.