Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

A new Region‑to‑Image Distillation (R2I) approach lets multimodal large language models perceive tiny visual details in a single forward pass, eliminating costly tool calls while achieving state‑of‑the‑art accuracy on the ZoomBench fine‑grained benchmark.

ZoomBenchfine-grained perceptionlarge language models

0 likes · 11 min read

Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs