Artificial Intelligence 10 min read

Boosting Visual Reasoning in VLMs with Reinforcement Learning

The article analyzes how reinforcement learning, which transformed LLM reasoning in DeepSeek, can be applied to visual‑language models to overcome the limitations of traditional chain‑of‑thought prompting and supervised fine‑tuning, presenting concrete reward designs, training pipelines, and a critical assessment of their strengths and weaknesses.

AI Algorithm Path

Apr 20, 2025

Boosting Visual Reasoning in VLMs with Reinforcement Learning

Before DeepSeek, large language models (LLMs) performed poorly on reasoning tasks. DeepSeek’s use of reinforcement learning (RL) dramatically improved textual reasoning, prompting the question of whether similar gains can be achieved for visual‑language models (VLMs).

What is visual reasoning? It is the ability to answer complex image‑related questions by explicitly reasoning about visual content. The desired process mirrors chain‑of‑thought (CoT): summary, caption, reasoning, and conclusion.

Visual reasoning is a capability that enables accurate answers to complex image‑based queries.

The article outlines two non‑RL approaches for inducing CoT in VLMs:

Direct prompting (CoT) : Convert a single prompt into a sequence of sub‑prompts that force the model to follow the four‑step CoT format. This method is simple and requires no training, but it does not change the model’s intrinsic abilities.

CoT‑SFT : Supervised fine‑tuning (SFT) on a dataset of image‑question‑answer triples where each answer follows the CoT format. This can embed the reasoning pattern into the model, yet it suffers from limited generalization and reliance on the quality of the synthetic data.

Recent work (LLaVA‑o1, arXiv:2411.10440) applied CoT‑SFT, achieving performance above most comparable models but still lagging behind GPT‑4o and inheriting several drawbacks:

Dependence on GPT‑4o‑generated data limits ceiling performance.

SFT tends to memorize training samples rather than generalize.

Hard‑coded reasoning steps may be suboptimal for some questions.

RL‑based visual reasoning

The proposed RL pipeline defines a reward function with two components:

Correctness reward : +1 when the model’s answer matches the ground‑truth label (e.g., “dog”).

Format reward : additional reward when the model outputs its reasoning inside <think>…</think> tags and the final answer inside <answer>…</answer> tags, encouraging explicit step‑by‑step thinking.

Training samples consist of an image, a descriptive title, and a question whose answer is the title. The RL agent optimizes the model’s parameters to maximize the combined reward, allowing the LLM to discover its own reasoning strategies rather than following a hard‑coded script.

Advantages of the RL approach include:

No reliance on noisy AI‑generated data; the reward signal is a clean, task‑specific signal.

Better generalization because the model learns to maximize a well‑defined objective rather than memorizing examples.

Flexibility to adapt reasoning strategies to diverse problems, similar to human learning.

In contrast, SFT‑based methods are limited by data quality and lack of adaptability. The article concludes that applying RL to VLMs holds significant research potential for improving both generalization and efficiency of visual reasoning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM chain of thought reinforcement learning visual-language models RL training

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.