LMM‑R1: A Two‑Stage Reinforcement Learning Framework for Enhancing Multimodal Model Reasoning
Researchers from Ant Group, Southeast University and others introduced the open‑source LMM‑R1 framework, a two‑stage reinforcement‑learning approach that first strengthens textual reasoning and then generalizes it to multimodal tasks, achieving significant performance gains on benchmarks such as football, Sokoban, and geometry reasoning with modest GPU costs.
Although rule‑based reinforcement learning (RL) has shown strong performance in text‑only large models, extending it to multimodal large models faces two major challenges: limited high‑quality multimodal data and weak foundational reasoning ability.
To address these issues, Ant Group together with Southeast University and other institutions released the open‑source LMM‑R1 framework. By employing an innovative two‑stage training strategy, LMM‑R1 significantly improves performance on both multimodal and pure‑text benchmarks, surpassing Gemini 1.5‑pro and Claude 3.5‑sonnet on a 3B‑size football game and beating GPT‑4o on a Sokoban task after limited RL fine‑tuning.
The related code has been merged into the OpenRLHF‑M project and the framework has quickly attracted academic attention, accumulating over 500 stars on GitHub since its February 2025 release.
Project page: https://forjadeforest.github.io/LMM-R1-ProjectPage/
Codebase: https://github.com/TideDra/lmm-r1
Paper: https://arxiv.org/abs/2503.07536
HuggingFace: https://huggingface.co/VLM-Reasoner
Two‑Stage Training Strategy
Stage 1 – Fundamental Reasoning Enhancement (FRE) : Utilizes abundant high‑quality pure‑text reasoning data (e.g., mathematics, science) with rule‑based RL to strengthen logical thinking, multi‑step inference, and complex calculations, building a solid reasoning foundation without relying on multimodal data.
Stage 2 – Multimodal Generalization Training (MGT) : Transfers the reasoning abilities learned in FRE to multimodal domains. The team explores three key areas: geometric reasoning (using GeoDB), perception‑reasoning balance (using VerMulti), and agent‑related tasks such as Sokoban.
Experiments using Qwen‑VL‑Instruct‑3B as the baseline show average improvements of 4.5%–4.8% on both text and multimodal benchmarks, with especially strong gains on geometry‑heavy tasks. The two‑stage approach also prevents the typical reasoning degradation observed when training directly on multimodal data.
In agent‑centric evaluations (e.g., Sokoban), the LMM‑R1‑enhanced model can plan complete action sequences from a single initial frame, demonstrating robust visual‑spatial reasoning and planning.
The framework builds on the OpenRLHF upstream project, introducing innovations such as PackingSample + Ring FlashAttention, which enables linear scaling of context length with GPU count and reduces resource consumption through dynamic gradient clipping.
The team plans to continue advancing multimodal RL techniques for applications like visual question answering and intelligent agents, collaborating with the open‑source community.
Business applications: The framework is expected to benefit finance and insurance sectors where multimodal reasoning is critical (e.g., compliance review, claim assessment, underwriting).
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.