How Visual‑RFT Extends Reinforcement Fine‑Tuning to Multimodal Models
Visual‑RFT introduces a reinforcement‑fine‑tuning paradigm for large multimodal models, extending rule‑based reward strategies from text‑only LLMs to visual‑language tasks such as detection and grounding, and demonstrates strong few‑shot performance gains over traditional supervised fine‑tuning across multiple benchmarks.
Background
Visual‑RFT (Visual Reinforcement Fine‑Tuning) extends the rule‑based reinforcement learning paradigm used in OpenAI’s o1 and DeepSeek‑R1 to large visual‑language models (LVLMs). By defining verifiable rewards for visual tasks such as fine‑grained classification and object detection, the method enables LVLMs to benefit from reinforcement fine‑tuning with only a few annotated examples.
Method
The framework augments a base LVLM (e.g., QWen2‑VL 2B/7B) with a two‑stage generation process:
The model first produces a think step that generates intermediate reasoning.
It then outputs the final answer.
During training, reinforcement learning (e.g., the GRPO algorithm) updates model parameters based on two verified reward types:
IoU‑based reward : for detection and grounding tasks, the reward is proportional to the Intersection‑over‑Union between predicted and ground‑truth bounding boxes.
Classification reward : for fine‑grained classification, a positive reward is given when the predicted class label matches the annotation.
Both rewards are computed on a small set of labeled samples (10–1,000 examples) and used to fine‑tune the LVLM. The think step forces the model to reason before answering, which improves accuracy and interpretability.
Experiments
Visual‑RFT was evaluated on several visual perception benchmarks, including:
Open‑vocabulary object detection
Few‑shot detection
Fine‑grained classification
Reasoning grounding (localising objects after a reasoning step)
Across all settings, Visual‑RFT consistently outperformed standard Supervised Fine‑Tuning (SFT). Notable observations:
Significant performance gains in open‑vocabulary and few‑shot detection with only a handful of training examples.
Improved grounding accuracy, where the model correctly identifies and localises objects after the think phase.
Figures illustrate performance curves, qualitative examples, and the reward‑based training pipeline.
Results
The experiments demonstrate that Visual‑RFT achieves strong few‑shot learning ability and better generalisation than SFT, even on niche domains such as cartoon characters collected from the web. By requiring far fewer annotated samples while delivering higher accuracy, Visual‑RFT establishes a new fine‑tuning paradigm for multimodal models.
Resources
Paper: https://arxiv.org/abs/2503.01785
Code repository: https://github.com/Liuziyu77/Visual-RFT
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
