AI Frontier Lectures
Feb 6, 2026 · Artificial Intelligence
Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?
The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.
AdaVaRMixture-of-Visual-ThoughtsVisual Reasoning
0 likes · 16 min read
