Artificial Intelligence 8 min read

Why Multimodal LLMs Still Struggle with Multi-Image Math Reasoning: Insights from MV‑MATH

This article introduces the MV‑MATH dataset, a large‑scale multi‑image math benchmark, and evaluates 24 open‑source and closed‑source multimodal large language models, revealing significant performance gaps, especially on complex visual dependencies and higher difficulty levels.

AI Frontier Lectures

Mar 20, 2025

Why Multimodal LLMs Still Struggle with Multi-Image Math Reasoning: Insights from MV‑MATH

MV-MATH Dataset

The MV-MATH benchmark addresses the lack of multi‑visual math reasoning data by providing 2,009 high‑quality K‑12 math problems. Each problem combines 2–8 interleaved images with textual descriptions, forming complex multi‑image scenes.

Problems are categorized into multiple‑choice, fill‑in‑the‑blank, and multi‑step QA types and span eleven mathematical domains (analytic geometry, algebra, combinatorics, logic, statistics, etc.) with three difficulty levels reflected in answer length.

Multi‑visual scenes: Every question includes multiple images, creating rich visual contexts.

Rich annotations: Each sample is cross‑validated by at least two annotators and includes the question, answer, detailed analysis, and image‑relevance tags.

Diverse domains: Coverage ranges from basic arithmetic to advanced geometry.

Image relevance: The dataset is split into a Mutually Dependent (MD) subset, where images rely on each other, and an Independent (ID) subset, where images can be interpreted separately.

Comprehensive Multi‑Image Reasoning Evaluation

The authors evaluated 24 mainstream open‑source and closed‑source multimodal models on MV-MATH. Even the strongest models fall far short of human performance (human accuracy ≈ 76.5%).

Claude‑3.5 achieved the highest overall accuracy of 33.9%.

GPT‑4o reached 32.1% and Gemini‑1.5‑Pro 29.1%.

Qwen‑VL‑max scored 26.9%.

Open‑source LLaVA‑OneVision‑Chat‑72B obtained 26.2%.

The o1‑style model QVQ‑72B‑Preview achieved 29.3%.

Performance varied across dimensions:

Domain: Claude‑3.5 excelled in arithmetic (54.2% accuracy) but performed poorly on combinatorial geometry (27.0%).

Difficulty: GPT‑4o was best on easy questions (40.3%); Claude‑3.5 led on medium difficulty (37.5%). All models dropped sharply on hard questions, with Claude‑3.5 only 26.6%.

Prompting strategy: Chain‑of‑Thought (CoT) and few‑shot prompting did not consistently improve results; for many open‑source models they reduced accuracy.

Image relevance: Models performed noticeably worse on the MD subset than on the ID subset, highlighting challenges in cross‑image reasoning.

Image input format: Supplying images as an ordered sequence consistently outperformed concatenated (merged) image inputs, underscoring the importance of preserving positional and sequential information.

Conclusion

Despite recent breakthroughs with slow‑thinking models (e.g., OpenAI o1, DeepSeek‑R1) in textual reasoning, multimodal models still lack robust paradigms for multi‑image mathematical reasoning. Extensive experiments confirm substantial gaps in visual‑textual integration, leaving ample room for improvement.

Paper: https://arxiv.org/abs/2502.20808

Project page: https://eternal8080.github.io/MV-MATH.github.io/

Code example

收
藏
，
分
享
、
在
看
，
给
个
三
连
击呗！

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI large language models model evaluation dataset math reasoning

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.