Boosting Visual Reasoning in VLMs with Reinforcement Learning

The article analyzes how reinforcement learning, which transformed LLM reasoning in DeepSeek, can be applied to visual‑language models to overcome the limitations of traditional chain‑of‑thought prompting and supervised fine‑tuning, presenting concrete reward designs, training pipelines, and a critical assessment of their strengths and weaknesses.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Boosting Visual Reasoning in VLMs with Reinforcement Learning

Before DeepSeek, large language models (LLMs) performed poorly on reasoning tasks. DeepSeek’s use of reinforcement learning (RL) dramatically improved textual reasoning, prompting the question of whether similar gains can be achieved for visual‑language models (VLMs).

What is visual reasoning? It is the ability to answer complex image‑related questions by explicitly reasoning about visual content. The desired process mirrors chain‑of‑thought (CoT): summary, caption, reasoning, and conclusion.

Visual reasoning is a capability that enables accurate answers to complex image‑based queries.

The article outlines two non‑RL approaches for inducing CoT in VLMs:

Direct prompting (CoT) : Convert a single prompt into a sequence of sub‑prompts that force the model to follow the four‑step CoT format. This method is simple and requires no training, but it does not change the model’s intrinsic abilities.

CoT‑SFT : Supervised fine‑tuning (SFT) on a dataset of image‑question‑answer triples where each answer follows the CoT format. This can embed the reasoning pattern into the model, yet it suffers from limited generalization and reliance on the quality of the synthetic data.

Recent work (LLaVA‑o1, arXiv:2411.10440) applied CoT‑SFT, achieving performance above most comparable models but still lagging behind GPT‑4o and inheriting several drawbacks:

Dependence on GPT‑4o‑generated data limits ceiling performance.

SFT tends to memorize training samples rather than generalize.

Hard‑coded reasoning steps may be suboptimal for some questions.

RL‑based visual reasoning

The proposed RL pipeline defines a reward function with two components:

Correctness reward : +1 when the model’s answer matches the ground‑truth label (e.g., “dog”).

Format reward : additional reward when the model outputs its reasoning inside <think>…</think> tags and the final answer inside <answer>…</answer> tags, encouraging explicit step‑by‑step thinking.

Training samples consist of an image, a descriptive title, and a question whose answer is the title. The RL agent optimizes the model’s parameters to maximize the combined reward, allowing the LLM to discover its own reasoning strategies rather than following a hard‑coded script.

Advantages of the RL approach include:

No reliance on noisy AI‑generated data; the reward signal is a clean, task‑specific signal.

Better generalization because the model learns to maximize a well‑defined objective rather than memorizing examples.

Flexibility to adapt reasoning strategies to diverse problems, similar to human learning.

In contrast, SFT‑based methods are limited by data quality and lack of adaptability. The article concludes that applying RL to VLMs holds significant research potential for improving both generalization and efficiency of visual reasoning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMchain of thoughtreinforcement learningvisual-language modelsRL training
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.