Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking
Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.
Background and motivation – The prevailing “think with images” paradigm augments multimodal large language models (MLLMs) with external tools or code generation to insert auxiliary images (e.g., cropping, annotation) during reasoning. Although effective, this approach suffers from high training and inference complexity, limited operation types, and poor scalability when new tools are required.
Key idea of Monet – Monet trains an MLLM to conduct visual reasoning entirely within a continuous latent visual embedding space, eliminating dependence on external tools. The model learns to generate a special <latent> token that signals the start of latent thinking; subsequent vectors are treated as implicit visual embeddings that are later consumed by the language decoder.
Challenges
Obtaining supervision for latent embeddings is difficult because auxiliary image tokens can be hundreds or thousands long, making direct alignment computationally expensive.
Standard next‑token prediction easily memorizes training data, bypassing optimization of the latent embeddings.
Reinforcement learning objectives (e.g., GRPO) cannot directly compute gradients for latent embeddings.
Training pipeline
Monet adopts a three‑stage supervised fine‑tuning (SFT) and reinforcement learning framework built on the base model Qwen2.5‑VL‑7B.
Stage 1 – Warm‑up SFT : Fine‑tune on the constructed Monet‑SFT‑125K dataset to adapt the model to interleaved image‑text reasoning. This prevents the model from ignoring intermediate auxiliary images.
Stage 2 – Alignment SFT : Introduce two losses:
Alignment loss that forces the latent embedding to match the representation of a Teacher CoT (auxiliary image) and a Student CoT (model‑generated latent embedding). The loss is applied only through the latent embedding so that gradients cannot bypass it.
Standard next‑token‑prediction loss to keep language modeling capability.
The combined loss yields high‑quality latent embeddings that encode visual information without explicit image tokens.
Stage 3 – VLPO (Visual‑Latent Policy Optimization) : A reinforcement‑learning algorithm that estimates the generation probability of latent embeddings and incorporates it into the policy loss. By aligning the sampled latent embedding with the high‑quality target from Stage 2, VLPO encourages the model to produce useful latent visual reasoning even when no auxiliary image is provided.
Dataset construction (Monet‑SFT‑125K)
The dataset is built through a three‑phase correction process:
Phase 1 selects samples where Qwen2.5‑VL‑7B fails on the question‑image pair, ensuring auxiliary images are necessary.
Phase 2 filters these samples with a stronger model (Qwen2.5‑VL‑72B) that answers correctly when the auxiliary image is present, guaranteeing the auxiliary image’s relevance.
Phase 3 uses a closed‑source model to annotate the few key tokens in the auxiliary image that are most relevant to the answer.
The final dataset contains diverse visual operations (crop, annotate, generate new visual states) across real‑world images, charts, and OCR‑centric tasks.
Model architecture
Monet‑7B inherits the transformer backbone of Qwen2.5‑VL‑7B. During inference, the model may emit a <latent> token; the following hidden vectors are treated as latent visual embeddings and later fed back into the decoder, after which the model resumes normal language generation.
Experimental results
Monet‑7B is evaluated on in‑distribution perception tasks (real‑world images, charts, OCR) and out‑of‑distribution abstract visual reasoning. Compared with the base model and prior “think with images” baselines, Monet achieves:
3 %–9.75 % absolute improvement on in‑distribution tasks.
2.31 % absolute gain on abstract visual reasoning tasks.
Figures 6 and 7 (included below) illustrate these gains.
Ablation studies
Removing any SFT stage degrades performance, confirming the necessity of warm‑up, alignment, and gradient‑flow constraints.
Adding VLPO on top of the SFT‑trained model yields further improvements, especially for out‑of‑distribution reasoning.
GRPO provides limited benefit and can even hurt stability when applied after Monet‑SFT, highlighting its limitation for latent visual reasoning.
Latent size scaling
Monet exhibits a test‑time scaling law: increasing the number of latent tokens at inference (even beyond the training length) consistently improves accuracy on in‑distribution tasks, while out‑of‑distribution tasks only benefit when VLPO is applied.
Conclusion
Monet demonstrates that training MLLMs to think directly in a latent visual space yields both higher efficiency (no external tool calls) and stronger reasoning capabilities across a range of visual tasks. The three‑stage SFT combined with VLPO provides a practical recipe for end‑to‑end latent visual reasoning.
References
Yang, Z. et al. “Machine mental imagery: Empower multimodal reasoning with latent visual tokens.” arXiv:2506.17218, 2025.
Li, B. et al. “Latent visual reasoning.” arXiv:2509.24251, 2025.
Shao, H. et al. “Visual CoT: Unleashing chain‑of‑thought reasoning in multi‑modal language models.” CoRR, 2024.
Li, A. et al. “Zebra‑CoT: A dataset for interleaved vision language reasoning.” arXiv:2507.16746, 2025.
Fu, X. et al. “Refocus: Visual editing as a chain of thought for structured image understanding.” ICML, 2025.
Qi, J. et al. “CogCom: A visual language model with chain‑of‑manipulations reasoning.” ICLR, 2025.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
