Artificial Intelligence 11 min read

MMEvalPro: A Trustworthy Benchmark for Evaluating Multimodal Large Models

MMEvalPro, a new benchmark created by researchers from Peking University, Chinese Academy of Medical Sciences, CUHK and Alibaba, augments existing multimodal datasets with perception and knowledge questions and introduces a Genuine Accuracy metric, revealing that top multimodal models still lag far behind humans and exposing shortcut‑driven performance on prior tests.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
MMEvalPro: A Trustworthy Benchmark for Evaluating Multimodal Large Models

Authors: MMEvalPro Team (Peking University, Chinese Academy of Medical Sciences, The Chinese University of Hong Kong, Alibaba)

Abstract: Recent multimodal large models such as GPT‑4o, Gemini‑pro and QwenVL‑Max achieve high rankings on existing benchmarks, but the credibility of these rankings has been questioned. The authors demonstrate that even without visual input, large language models (LLMs) can attain near‑state‑of‑the‑art performance on current multimodal tests, exposing a mismatch between benchmark design and true multimodal understanding.

Problem with Existing Benchmarks: Most multimodal evaluations use multiple‑choice questions (MCQ) that include an image, a question, candidate answers and a ground‑truth answer. This format is efficient but introduces bias, allowing models to exploit pattern shortcuts rather than genuine visual‑language reasoning. Two diagnostic tests—“Seeing vs. Not‑Seeing” and “Answer Consistency”—show that LLMs without image access can match or surpass multimodal models (LMMs) on these benchmarks, and that Type‑I errors (correct answers without real understanding) are frequent.

MMEvalPro Framework: To address these issues, the team constructed MMEVALPRO, a new benchmark that augments the MMMU, ScienceQA and MathVista datasets with two additional questions per original item: a perception question requiring visual detail and a knowledge question requiring background reasoning. A novel metric, “Genuine Accuracy”, measures the proportion of instances where a model correctly answers all three linked questions, thereby filtering out shortcut‑driven correct answers.

Dataset Statistics: MMEVALPRO contains 2,138 question triples (original, perception, knowledge), totaling 6,414 individual MCQs. Each triple was manually annotated and reviewed by at least two experts to ensure high quality.

Experimental Results: When evaluated on MMEVALPRO, top‑performing multimodal models (e.g., QwenVL‑MAX, GPT‑4o) still lag behind human performance by roughly 31.73% in Genuine Accuracy, a larger gap than the 8.03% reported on previous benchmarks. The performance advantage of the best LMM over the best LLM widens from 14.64% to 23.09% under the new metric. Moreover, the score disparity between LLMs and LMMs expands from a maximum 1.5× on legacy datasets to 4.8× on MMEVALPRO, highlighting the benchmark’s stricter discrimination of true multimodal capability.

Conclusion: MMEVALPRO provides a more rigorous and trustworthy evaluation protocol for multimodal large models by requiring integrated visual perception and domain knowledge. The authors invite the community to use the benchmark, share results, and contribute feedback.

large language modelsbenchmarktrustworthy AIMMEvalPromultimodal evaluation
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.