Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

This article examines the limitations of current vision‑language and reasoning models, proposes a visual reasoning model (VRM) that can process images and perform deep logical inference, and discusses architecture, training methods, reinforcement‑learning reward designs, and practical challenges.

JavaEdge
JavaEdge
JavaEdge
Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

1 LLM that can both see and reason?

DeepSeek‑R1 is a large reasoning model (LRM) that only handles text, while DeepSeek‑V3 is a vision‑language model (VLM) supporting both image and text inputs. The article proposes a visual reasoning model (VRM) that combines these capabilities.

DeepSeek‑R1 vs GPT‑4o
DeepSeek‑R1 vs GPT‑4o

2 Problems with existing models

Current VLMs excel at multimodal perception but lack strong reasoning, whereas LRMs can reason deeply but cannot process visual data. A model that both understands images and performs deep reasoning is needed.

Physical problem example

A student asks a physics question and attaches an image, requiring the model to interpret the picture and then reason about the problem.

Understand the image content.

Perform deep reasoning (analysis, answer evaluation, considering alternatives).

Student question with image
Student question with image

3 VLM architecture

Typical VLMs such as LLaVA use a CLIP visual encoder to convert an image into a vector, followed by a trainable linear projection. The resulting visual hidden state Hv is concatenated with text token embeddings and fed into a Transformer‑based LLM (e.g., Vicuna).

LLM token prediction illustration
LLM token prediction illustration
VLM image‑text joint prediction
VLM image‑text joint prediction

4 How VLM processes image input

The core idea is to transform raw image data into a format the LLM can understand. CLIP encodes the image into a vector; a linear layer maps this vector to the same dimensionality as text embeddings, after which Hv and text embeddings are concatenated and processed by the Transformer.

Image encoding pipeline
Image encoding pipeline

5 Training VLM

LLaVA adopts end‑to‑end fine‑tuning. During training the CLIP encoder is usually frozen, while the linear projection W and the LLM parameters ϕ are updated.

End‑to‑end fine‑tuning treats the whole model as a black box and updates all trainable parameters jointly.
End‑to‑end fine‑tuning diagram
End‑to‑end fine‑tuning diagram

6 Can reinforcement learning train VLM?

Reinforcement learning (RL) has boosted reasoning in LLMs (e.g., RLHF for GPT‑4). The article explores applying RL to VLMs, designing reward functions that encourage both correctness and structured output.

6.1 Task definition: image classification

The model should output the correct class label given an image.

Image classification example
Image classification example

Reward design

Correctness reward : +1 when the predicted label matches the ground truth (e.g., "dog").

Format reward : extra reward if the model outputs a structured response using <think> followed by <answer>.

Reward design illustration
Reward design illustration

7 Practical applications

Current VLMs still underperform on math and science questions. For example, GPT‑4o gave an incorrect answer to a physics problem that required visual reasoning. A stronger VRM could potentially solve such tasks correctly.

GPT‑4o wrong answer example
GPT‑4o wrong answer example
Desired correct answer
Desired correct answer

The article concludes that building a visual reasoning model that unifies perception and deep logical inference is a promising direction, though challenges such as effective RL reward design and robust VLM training remain.

Artificial Intelligencedeep learningLLMreinforcement learningVisual ReasoningVision Language Model
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.