Artificial Intelligence 9 min read

How M2-Reasoning-7B Achieves State‑of‑the‑Art Spatial Reasoning in Multimodal AI

M2-Reasoning-7B, an open‑source 7B multimodal model from Ant Group, combines a high‑quality data pipeline with dynamic multi‑task training and a novel reward function to deliver state‑of‑the‑art performance on both general and spatial reasoning benchmarks, surpassing many larger competitors.

AntTech

Jul 17, 2025

How M2-Reasoning-7B Achieves State‑of‑the‑Art Spatial Reasoning in Multimodal AI

Challenge: Spatial Blind Spot in Multimodal Models

Current multimodal large models excel at static image‑text tasks but struggle with dynamic spatial interactions such as object motion, relative positions, and scene geometry, limiting their applicability in autonomous driving, robotics, and AR/VR.

M2-Reasoning-7B Overview

The Ant Group inclusionAI team released M2-Reasoning-7B , a 7‑billion‑parameter multimodal model designed for unified general reasoning (math, logic) and spatial reasoning (movement, orientation, physical interaction). Model, code, and technical report are publicly available.

Model address: https://huggingface.co/inclusionAI/M2-Reasoning Code address: https://github.com/inclusionAI/M2-Reasoning Technical report: https://arxiv.org/abs/2507.08306

Two Killer Features

1. High‑quality "feeding": 294.2K data samples – a multi‑stage data synthesis pipeline generated 268K general reasoning samples and 26.2K spatial reasoning samples, each enriched with multimodal chain‑of‑thought annotations and difficulty scores.

2. Fine‑grained "training": dynamic multi‑task RLVR – after a cold‑start SFT on the chain‑of‑thought data, a dynamic reinforcement‑learning stage with curriculum learning, sample‑weight adjustment, cosine‑annealed KL penalty, and a custom reward called Exponential Decay Numeric Matching (EDNM) guides the model toward accurate spatial estimates.

Training Data Construction

The pipeline first creates high‑quality multimodal chain‑of‑thought data using strong models, then automatically evaluates each sample for answer correctness and reasoning quality, filtering out ambiguous or low‑quality items. Spatial tasks cover ten sub‑categories, from static object counting to dynamic video‑based direction estimation.

Dynamic Training and Reward Design

During RLVR, the model learns from easy to hard tasks (curriculum learning) while the framework dynamically adjusts sample weights to focus on medium‑difficulty examples that provide the strongest learning signal. The EDNM reward gives partial credit for predictions close to the ground truth, enabling smoother learning for continuous spatial quantities.

Performance Highlights

On six mainstream math and logic benchmarks (MathVista, MathVision, DynaMath, etc.) M2‑Reasoning‑7B achieves an average score of 45.0, surpassing WeThink‑VL‑7B (44.3) and InternVL3‑8B (41.4). In the CV‑Bench spatial suite it scores 82.3, topping the leaderboard. On the video‑based VSI‑Bench it reaches 42.3, beating InternVL3‑8B and approaching Gemini‑1.5‑pro, with new SOTA records on room‑size estimation and relative‑direction tasks.

Conclusion and Outlook

The success of M2‑Reasoning‑7B demonstrates that targeted high‑quality data construction combined with dynamic, task‑aware training can close the spatial reasoning gap in multimodal models, paving the way for more capable AI systems in real‑world scenarios. Future work will address remaining limitations such as shorter reasoning chains and occasional visual perception errors.