How VLM‑FO1 Turns Vision‑Language Models into Precise Perception Machines
VLM‑FO1 introduces a generate‑plus‑reference paradigm that replaces coordinate generation with region token referencing, adding plug‑in modules such as a proposal generator, a hybrid fine‑grained encoder, and a region‑language connector to give any pretrained visual language model accurate, fine‑grained perception while preserving its original capabilities.
Research Background
General visual language models (VLMs) excel at multimodal understanding but often "see" without being able to pinpoint details; they can answer "what is in the image" but fail to locate "where" precisely. This stems from their generative architecture, which is natural for language output but unnatural for continuous coordinate prediction.
VLMs understand images yet struggle with fine‑grained localization because generating continuous coordinates is a non‑natural task for a language‑oriented model.
Even state‑of‑the‑art VLMs such as Qwen2.5‑VL‑72B achieve less than 40% mAP on COCO detection, far behind traditional detectors that easily surpass 60%.
Prior Attempts
Coordinate Quantization: maps continuous coordinates to discrete labels, turning regression into classification but incurs noticeable precision loss.
Detection‑Head Attach: adds a dedicated detection head to boost performance, but breaks end‑to‑end consistency.
Retraining from Scratch: jointly retrains visual and language branches, which is costly and hard to reproduce.
All these solutions either sacrifice accuracy or compromise the model’s universality.
Methodology: VLM‑FO1 Overview
The Om AI Lab team proposes VLM‑FO1, a "generate + reference" hybrid paradigm that leaves the VLM backbone untouched and injects new visual perception capability through modular plug‑in components.
Key Components
Proposal Regions (OPN): an optional, plug‑in region proposal network that supplies candidate foreground regions.
Hybrid Fine‑grained Region Encoder (HFRE): a dual‑encoder design consisting of:
Primary Encoder: reuses the original VLM visual backbone to produce semantic‑level features.
Auxiliary Encoder: processes high‑resolution patches to capture texture, edge, and fine‑grained details.
Region–Language Connector (RLC): projects the concatenated visual features (with positional encoding) into the language embedding space, yielding <region0>, <region1>, … tokens that can be directly referenced during language generation.
Token‑Based Referencing: replaces explicit coordinate generation with direct calls to region tokens, eliminating cumulative errors and enabling stable multi‑object reasoning.
The overall architecture (see Image 2) shows the original VLM (blue dashed box) surrounded by the plug‑in modules, forming a seamless “plug‑and‑play” enhancement.
Training Strategy
VLM‑FO1 adopts a two‑stage optimization that keeps the backbone frozen:
Stage 1 – Region‑Language Alignment: freeze the VLM, train HFRE and the mapping layer so that region tokens align with the language model’s embedding space.
Stage 2 – Perception SFT: unfreeze the auxiliary encoder and language layers, fine‑tune on detection, OCR, counting, and RefExp data while mixing general multimodal corpora to prevent forgetting.
This “minimal‑change” approach achieves substantial fine‑grained perception gains without degrading the original VLM’s reasoning abilities.
Experimental Results
Across multiple fine‑grained benchmarks, VLM‑FO1 demonstrates consistent improvements:
COCO Detection: 44.4 mAP, surpassing generic VLMs and approaching specialized detectors.
COCOText OCR: 59% accuracy, 13 percentage points higher than the next best model.
HumanRef Referring Expression: 82.6% DF1, exceeding 80% on RefCOCO series.
PixMo‑Count Counting: 86% accuracy, out‑performing larger multimodal models.
OpenCompass General Evaluation: score remains virtually unchanged (64.6 vs. 64.5), confirming that the enhancement does not hurt general multimodal abilities.
Qualitative visualizations (Images 1, 3‑6) illustrate precise bounding boxes, clear OCR results on dense text, and stable referring expression grounding even under occlusion.
Ablation Studies
Removing any sub‑module of HFRE (high‑resolution branch or fusion layer) leads to a marked performance drop, confirming the necessity of the dual‑encoder and multi‑scale fusion design for fine‑grained perception.
Conclusion
VLM‑FO1 delivers more than higher metrics; it establishes a reusable, plug‑and‑play multimodal enhancement paradigm that lets any pretrained VLM “see clearly and point accurately” without altering its core architecture. This represents a paradigm shift from pure generative output to a hybrid generate‑plus‑reference framework, enabling both high‑level reasoning and fine‑grained visual perception.
Key Takeaways
Transforming coordinate generation into region token referencing resolves the inherent mismatch between language‑centric VLMs and spatial tasks.
The modular design (OPN + HFRE + RLC) offers a low‑cost path for labs to upgrade existing VLMs.
Two‑stage training preserves original language abilities while endowing robust fine‑grained perception.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
