AntTech
AntTech
Mar 4, 2026 · Artificial Intelligence

Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs

A new Region‑to‑Image Distillation (R2I) approach lets multimodal large language models perceive tiny visual details in a single forward pass, eliminating costly tool calls while achieving state‑of‑the‑art accuracy on the ZoomBench fine‑grained benchmark.

Model EfficiencyZoomBenchfine-grained perception
0 likes · 11 min read
Zooming Without Zooming: One‑Pass Fine‑Grained Vision for Multimodal LLMs
Data Party THU
Data Party THU
Nov 5, 2025 · Artificial Intelligence

How VLM‑FO1 Turns Vision‑Language Models into Precise Perception Machines

VLM‑FO1 introduces a generate‑plus‑reference paradigm that replaces coordinate generation with region token referencing, adding plug‑in modules such as a proposal generator, a hybrid fine‑grained encoder, and a region‑language connector to give any pretrained visual language model accurate, fine‑grained perception while preserving its original capabilities.

AI researchPlug-and-PlayVLM
0 likes · 15 min read
How VLM‑FO1 Turns Vision‑Language Models into Precise Perception Machines