Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking
Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.
