Do Interleaved Images Really Help Thinking‑with‑Images Models?

An analysis of recent Vision‑Language models shows that removing interleaved images has minimal impact on benchmark performance, suggesting that better priors from RL fine‑tuning and effective context management are the key drivers of success.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Do Interleaved Images Really Help Thinking‑with‑Images Models?

During a group meeting the author noticed that some reproduced "Thinking with Images" works omitted the interleaved images, yet the model performance did not degrade, prompting a deeper investigation.

Some vision‑language models hallucinate tool usage even when no tool is executed, a phenomenon similar to the hallucinations observed in LLM4Math.

Experiments on several successful "Thinking with Images" projects (pixel‑reasoner, DeepEyes, Thyme) that removed the interleaved images – i.e., suppressed visual tool/code outputs – showed only a small effect on benchmark scores.

All these works share a common setting: they start from Qwen2.5VL‑7B‑Instruct and are further refined with reinforcement‑learning (RL). The performance gap between the RL‑fine‑tuned models and the base model is far larger than the gap caused by dropping interleaved images.

Attention‑rollout visualizations reveal that the model primarily attends to the input image before the interleaved image and often arrives at the answer already after processing the input image and the reasoning step. In a V* example, the model first looks at the input image and query, then the thinking step, followed by the interleaved image, and finally the answer.

Another example from arXiv:2510.17771 shows the model attending the correct region (a receiver) but answering "Logitech" because the brand appears more frequently in the training corpus, illustrating how strong priors can both help and cause hallucinations.

The author hypothesizes that better priors – provided by RL fine‑tuning – may be the main source of performance gains. To test this, they trained a pure‑text SFT model using the DeepEyes RL data and found that it matches the performance of the full vision‑language pipeline. A similar result is reported in arXiv:2511.22586 , which shows pure‑text RL can also achieve comparable scores.

Consequently, the role of interleaved images appears limited, while stronger priors are valuable. The author questions whether cropping or zooming of images truly benefits Vision‑Language Models (VLMs).

Because input images are patchified, cropping merely duplicates the patches containing the object, and zooming does not introduce unseen scales that fall outside the training distribution. This claim is supported by arXiv:2510.00054 , which reports that scale variations have little impact on newer VLMs.

The main difficulty of perception benchmarks is long‑context, i.e., many irrelevant image tokens that act as noise, and similarly noisy text tokens can distract the model.

From a context‑management perspective, the recommendation is to introduce useful context (object‑containing crops) while avoiding noisy tool/code tokens that add irrelevant information.

Training‑free cropping/zooming methods such as Vicrop ( arXiv:2502.17422 ) and Hide ( arXiv:2510.00054 ) exemplify how to inject only the object‑relevant crops. Combining these with better priors from RL could yield further improvements.

Future directions include extending tool calls beyond cropping, e.g., DeepEyes v2’s image search, leveraging pure‑vision feedback for tasks like video tracking (Molmo 2, arXiv:2601.10611 ), and exploring grounding or pointing mechanisms ( arXiv:2509.23746 ).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Vision-Language ModelsAttention RolloutCropping MethodsInterleaved ImagesRL Fine‑TuningTool Hallucination
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.