Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.

Machine Heart
Machine Heart
Machine Heart
Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Background and motivation – The prevailing “think with images” paradigm augments multimodal large language models (MLLMs) with external tools or code generation to insert auxiliary images (e.g., cropping, annotation) during reasoning. Although effective, this approach suffers from high training and inference complexity, limited operation types, and poor scalability when new tools are required.

Key idea of Monet – Monet trains an MLLM to conduct visual reasoning entirely within a continuous latent visual embedding space, eliminating dependence on external tools. The model learns to generate a special <latent> token that signals the start of latent thinking; subsequent vectors are treated as implicit visual embeddings that are later consumed by the language decoder.

Challenges

Obtaining supervision for latent embeddings is difficult because auxiliary image tokens can be hundreds or thousands long, making direct alignment computationally expensive.

Standard next‑token prediction easily memorizes training data, bypassing optimization of the latent embeddings.

Reinforcement learning objectives (e.g., GRPO) cannot directly compute gradients for latent embeddings.

Training pipeline

Monet adopts a three‑stage supervised fine‑tuning (SFT) and reinforcement learning framework built on the base model Qwen2.5‑VL‑7B.

Stage 1 – Warm‑up SFT : Fine‑tune on the constructed Monet‑SFT‑125K dataset to adapt the model to interleaved image‑text reasoning. This prevents the model from ignoring intermediate auxiliary images.

Stage 2 – Alignment SFT : Introduce two losses:

Alignment loss that forces the latent embedding to match the representation of a Teacher CoT (auxiliary image) and a Student CoT (model‑generated latent embedding). The loss is applied only through the latent embedding so that gradients cannot bypass it.

Standard next‑token‑prediction loss to keep language modeling capability.

The combined loss yields high‑quality latent embeddings that encode visual information without explicit image tokens.

Stage 3 – VLPO (Visual‑Latent Policy Optimization) : A reinforcement‑learning algorithm that estimates the generation probability of latent embeddings and incorporates it into the policy loss. By aligning the sampled latent embedding with the high‑quality target from Stage 2, VLPO encourages the model to produce useful latent visual reasoning even when no auxiliary image is provided.

Dataset construction (Monet‑SFT‑125K)

The dataset is built through a three‑phase correction process:

Phase 1 selects samples where Qwen2.5‑VL‑7B fails on the question‑image pair, ensuring auxiliary images are necessary.

Phase 2 filters these samples with a stronger model (Qwen2.5‑VL‑72B) that answers correctly when the auxiliary image is present, guaranteeing the auxiliary image’s relevance.

Phase 3 uses a closed‑source model to annotate the few key tokens in the auxiliary image that are most relevant to the answer.

The final dataset contains diverse visual operations (crop, annotate, generate new visual states) across real‑world images, charts, and OCR‑centric tasks.

Model architecture

Monet‑7B inherits the transformer backbone of Qwen2.5‑VL‑7B. During inference, the model may emit a <latent> token; the following hidden vectors are treated as latent visual embeddings and later fed back into the decoder, after which the model resumes normal language generation.

Experimental results

Monet‑7B is evaluated on in‑distribution perception tasks (real‑world images, charts, OCR) and out‑of‑distribution abstract visual reasoning. Compared with the base model and prior “think with images” baselines, Monet achieves:

3 %–9.75 % absolute improvement on in‑distribution tasks.

2.31 % absolute gain on abstract visual reasoning tasks.

Figures 6 and 7 (included below) illustrate these gains.

Ablation studies

Removing any SFT stage degrades performance, confirming the necessity of warm‑up, alignment, and gradient‑flow constraints.

Adding VLPO on top of the SFT‑trained model yields further improvements, especially for out‑of‑distribution reasoning.

GRPO provides limited benefit and can even hurt stability when applied after Monet‑SFT, highlighting its limitation for latent visual reasoning.

Latent size scaling

Monet exhibits a test‑time scaling law: increasing the number of latent tokens at inference (even beyond the training length) consistently improves accuracy on in‑distribution tasks, while out‑of‑distribution tasks only benefit when VLPO is applied.

Conclusion

Monet demonstrates that training MLLMs to think directly in a latent visual space yields both higher efficiency (no external tool calls) and stronger reasoning capabilities across a range of visual tasks. The three‑stage SFT combined with VLPO provides a practical recipe for end‑to‑end latent visual reasoning.

References

Yang, Z. et al. “Machine mental imagery: Empower multimodal reasoning with latent visual tokens.” arXiv:2506.17218, 2025.

Li, B. et al. “Latent visual reasoning.” arXiv:2509.24251, 2025.

Shao, H. et al. “Visual CoT: Unleashing chain‑of‑thought reasoning in multi‑modal language models.” CoRR, 2024.

Li, A. et al. “Zebra‑CoT: A dataset for interleaved vision language reasoning.” arXiv:2507.16746, 2025.

Fu, X. et al. “Refocus: Visual editing as a chain of thought for structured image understanding.” ICML, 2025.

Qi, J. et al. “CogCom: A visual language model with chain‑of‑manipulations reasoning.” ICLR, 2025.

Monet overview
Monet overview
Training pipeline
Training pipeline
SFT stages
SFT stages
Performance on in‑distribution tasks
Performance on in‑distribution tasks
Performance on abstract reasoning
Performance on abstract reasoning
multimodalReinforcement learningMLLMVisual ReasoningLatent Embedding
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.