How VLM‑FO1 Turns Vision‑Language Models into Precise Perception Machines

VLM‑FO1 introduces a generate‑plus‑reference paradigm that replaces coordinate generation with region token referencing, adding plug‑in modules such as a proposal generator, a hybrid fine‑grained encoder, and a region‑language connector to give any pretrained visual language model accurate, fine‑grained perception while preserving its original capabilities.

Data Party THU
Data Party THU
Data Party THU
How VLM‑FO1 Turns Vision‑Language Models into Precise Perception Machines

Research Background

General visual language models (VLMs) excel at multimodal understanding but often "see" without being able to pinpoint details; they can answer "what is in the image" but fail to locate "where" precisely. This stems from their generative architecture, which is natural for language output but unnatural for continuous coordinate prediction.

VLMs understand images yet struggle with fine‑grained localization because generating continuous coordinates is a non‑natural task for a language‑oriented model.

Even state‑of‑the‑art VLMs such as Qwen2.5‑VL‑72B achieve less than 40% mAP on COCO detection, far behind traditional detectors that easily surpass 60%.

Prior Attempts

Coordinate Quantization: maps continuous coordinates to discrete labels, turning regression into classification but incurs noticeable precision loss.

Detection‑Head Attach: adds a dedicated detection head to boost performance, but breaks end‑to‑end consistency.

Retraining from Scratch: jointly retrains visual and language branches, which is costly and hard to reproduce.

All these solutions either sacrifice accuracy or compromise the model’s universality.

Methodology: VLM‑FO1 Overview

The Om AI Lab team proposes VLM‑FO1, a "generate + reference" hybrid paradigm that leaves the VLM backbone untouched and injects new visual perception capability through modular plug‑in components.

Key Components

Proposal Regions (OPN): an optional, plug‑in region proposal network that supplies candidate foreground regions.

Hybrid Fine‑grained Region Encoder (HFRE): a dual‑encoder design consisting of:

Primary Encoder: reuses the original VLM visual backbone to produce semantic‑level features.

Auxiliary Encoder: processes high‑resolution patches to capture texture, edge, and fine‑grained details.

Region–Language Connector (RLC): projects the concatenated visual features (with positional encoding) into the language embedding space, yielding <region0>, <region1>, … tokens that can be directly referenced during language generation.

Token‑Based Referencing: replaces explicit coordinate generation with direct calls to region tokens, eliminating cumulative errors and enabling stable multi‑object reasoning.

The overall architecture (see Image 2) shows the original VLM (blue dashed box) surrounded by the plug‑in modules, forming a seamless “plug‑and‑play” enhancement.

Image
Image

Training Strategy

VLM‑FO1 adopts a two‑stage optimization that keeps the backbone frozen:

Stage 1 – Region‑Language Alignment: freeze the VLM, train HFRE and the mapping layer so that region tokens align with the language model’s embedding space.

Stage 2 – Perception SFT: unfreeze the auxiliary encoder and language layers, fine‑tune on detection, OCR, counting, and RefExp data while mixing general multimodal corpora to prevent forgetting.

This “minimal‑change” approach achieves substantial fine‑grained perception gains without degrading the original VLM’s reasoning abilities.

Experimental Results

Across multiple fine‑grained benchmarks, VLM‑FO1 demonstrates consistent improvements:

COCO Detection: 44.4 mAP, surpassing generic VLMs and approaching specialized detectors.

COCOText OCR: 59% accuracy, 13 percentage points higher than the next best model.

HumanRef Referring Expression: 82.6% DF1, exceeding 80% on RefCOCO series.

PixMo‑Count Counting: 86% accuracy, out‑performing larger multimodal models.

OpenCompass General Evaluation: score remains virtually unchanged (64.6 vs. 64.5), confirming that the enhancement does not hurt general multimodal abilities.

Qualitative visualizations (Images 1, 3‑6) illustrate precise bounding boxes, clear OCR results on dense text, and stable referring expression grounding even under occlusion.

Image
Image

Ablation Studies

Removing any sub‑module of HFRE (high‑resolution branch or fusion layer) leads to a marked performance drop, confirming the necessity of the dual‑encoder and multi‑scale fusion design for fine‑grained perception.

Image
Image

Conclusion

VLM‑FO1 delivers more than higher metrics; it establishes a reusable, plug‑and‑play multimodal enhancement paradigm that lets any pretrained VLM “see clearly and point accurately” without altering its core architecture. This represents a paradigm shift from pure generative output to a hybrid generate‑plus‑reference framework, enabling both high‑level reasoning and fine‑grained visual perception.

Key Takeaways

Transforming coordinate generation into region token referencing resolves the inherent mismatch between language‑centric VLMs and spatial tasks.

The modular design (OPN + HFRE + RLC) offers a low‑cost path for labs to upgrade existing VLMs.

Two‑stage training preserves original language abilities while endowing robust fine‑grained perception.

multimodalAI researchVLMfine-grained perceptionPlug-and-Playtoken referencing
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.