Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks
The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.
