Artificial Intelligence 16 min read

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.

AI Frontier Lectures

Feb 6, 2026

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

Background

Current large vision‑language models (LVLMs) employ either a pure‑text reasoning style, mirroring large language models (LLMs), or a visually‑grounded approach that aligns reasoning steps with image regions. Each mode excels in different domains—text‑only reasoning is strong on abstract, mathematical problems, while grounded reasoning better handles visual search and object‑centric tasks—but existing work focuses on a single mode and cannot exploit their complementary strengths.

Mixture‑of‑Visual‑Thoughts (MoVT)

MoVT proposes a self‑adaptive reasoning paradigm that unifies both modes inside one model and lets the model choose the most suitable mode for each query. The key ideas are:

Introduce special <text> and <ground> prefix tokens at the start of the generation to signal the desired reasoning mode.

Use supervised fine‑tuning (SFT) on data for each mode to teach the model both reasoning styles.

Apply a reinforcement‑learning stage (AdaVaR) that encourages the model to select the better mode for a given problem.

AdaVaR Learning Framework

The AdaVaR framework consists of two stages:

Prefix‑guided SFT : The model learns to generate the appropriate prefix token ( <text> or <ground>) and then follows the corresponding reasoning process.

AdaGRPO reinforcement learning : For each question the model generates n rollouts using the text mode and n rollouts using the grounded mode. Rewards are computed from answer correctness. Two advantage signals are defined: A_i – rollout‑level advantage that improves overall reasoning quality. A_t and A_v – mode‑wise advantages that compare the relative success rates of the two modes, guiding the model to prefer the better mode.

Prefix tokens receive the mode‑wise advantage, while the reasoning tokens receive the rollout‑level advantage, enabling the model to learn both better reasoning and mode selection.

Experimental Setup

Models based on Qwen2.5‑VL‑3B and Qwen2.5‑VL‑7B were fine‑tuned with AdaVaR (AdaVaR‑3B/7B). Evaluation covered eight datasets spanning mathematical visual reasoning (MathVista, WeMath), object counting (V*), and multimodal perception (POPE, MMStar, MathVision). Baselines included single‑mode SFT/RL models and strong existing LVLMs.

Results

Across all tasks, AdaVaR‑3B and AdaVaR‑7B consistently outperformed the base Qwen2.5‑VL models and other single‑mode baselines. Notably, AdaVaR‑7B surpassed GPT‑4o on average accuracy, and AdaVaR‑3B matched the performance of the much larger Qwen2.5‑VL‑7B despite having only 3 B parameters. Detailed analysis showed:

Text‑only models excel on abstract math but suffer from hallucinations on visual search.

Grounded models reduce hallucinations and excel on object‑centric tasks but lag on abstract reasoning.

The adaptive MoVT approach leverages the strengths of both, achieving higher upper‑bound performance than either mode alone.

In‑Depth Analysis

Key questions investigated include:

Can a single model learn both modes? Yes—AdaVaR‑3B/7B demonstrate that integrating the two does not inhibit either mode.

Is explicit mode prefix necessary? Experiments removing the prefix (Mix‑SFT‑RL baseline) performed worse, confirming the importance of explicit mode signals.

Does the model learn sensible mode‑selection? After RL, the model prefers text mode on math tasks and grounded mode on visual search, reflecting learned preferences.

Training curves reveal three phases: early exploration with noisy mode choices, a stabilization phase where the dominant mode emerges, and a fine‑tuning phase where both modes improve and the adaptive policy surpasses each single mode.

Conclusion and Future Work

MoVT demonstrates that integrating multiple reasoning styles within a unified LVLM is a viable path toward general visual reasoning. AdaGRPO effectively teaches the model to select the appropriate mode. Future directions include adding more diverse reasoning modes (e.g., tool‑use, long‑vs‑short thinking), improving exploration‑exploitation balance when many modes are present, and exploring sequential mode‑switching strategies.