How Visual‑RFT Extends Reinforcement Fine‑Tuning to Multimodal Models

Visual‑RFT introduces a reinforcement‑fine‑tuning paradigm for large multimodal models, extending rule‑based reward strategies from text‑only LLMs to visual‑language tasks such as detection and grounding, and demonstrates strong few‑shot performance gains over traditional supervised fine‑tuning across multiple benchmarks.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How Visual‑RFT Extends Reinforcement Fine‑Tuning to Multimodal Models

Background

Visual‑RFT (Visual Reinforcement Fine‑Tuning) extends the rule‑based reinforcement learning paradigm used in OpenAI’s o1 and DeepSeek‑R1 to large visual‑language models (LVLMs). By defining verifiable rewards for visual tasks such as fine‑grained classification and object detection, the method enables LVLMs to benefit from reinforcement fine‑tuning with only a few annotated examples.

Method

The framework augments a base LVLM (e.g., QWen2‑VL 2B/7B) with a two‑stage generation process:

The model first produces a think step that generates intermediate reasoning.

It then outputs the final answer.

During training, reinforcement learning (e.g., the GRPO algorithm) updates model parameters based on two verified reward types:

IoU‑based reward : for detection and grounding tasks, the reward is proportional to the Intersection‑over‑Union between predicted and ground‑truth bounding boxes.

Classification reward : for fine‑grained classification, a positive reward is given when the predicted class label matches the annotation.

Both rewards are computed on a small set of labeled samples (10–1,000 examples) and used to fine‑tune the LVLM. The think step forces the model to reason before answering, which improves accuracy and interpretability.

Experiments

Visual‑RFT was evaluated on several visual perception benchmarks, including:

Open‑vocabulary object detection

Few‑shot detection

Fine‑grained classification

Reasoning grounding (localising objects after a reasoning step)

Across all settings, Visual‑RFT consistently outperformed standard Supervised Fine‑Tuning (SFT). Notable observations:

Significant performance gains in open‑vocabulary and few‑shot detection with only a handful of training examples.

Improved grounding accuracy, where the model correctly identifies and localises objects after the think phase.

Figures illustrate performance curves, qualitative examples, and the reward‑based training pipeline.

Visual‑RFT reasoning example with Pokémon
Visual‑RFT reasoning example with Pokémon
Performance comparison – Visual‑RFT vs. SFT
Performance comparison – Visual‑RFT vs. SFT
Framework diagram – IoU and classification rewards with GRPO
Framework diagram – IoU and classification rewards with GRPO

Results

The experiments demonstrate that Visual‑RFT achieves strong few‑shot learning ability and better generalisation than SFT, even on niche domains such as cartoon characters collected from the web. By requiring far fewer annotated samples while delivering higher accuracy, Visual‑RFT establishes a new fine‑tuning paradigm for multimodal models.

Resources

Paper: https://arxiv.org/abs/2503.01785

Code repository: https://github.com/Liuziyu77/Visual-RFT

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Few‑Shot LearningOpen-sourceVisual-RFT
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.