Artificial Intelligence 11 min read

Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University

The new Video‑MME‑v2 benchmark reveals that despite saturated high scores on existing video‑understanding tests, the strongest commercial model (Gemini‑3‑Pro) reaches only 49.4 points versus a human expert’s 90.7, highlighting the benchmark’s layered ability system, group‑level non‑linear scoring, and the nuanced impact of "Thinking" features.

Machine Heart

Apr 13, 2026

Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University

Current video‑understanding benchmarks have become saturated, showing a large gap between reported scores and real‑world experience. To address this, the Nanjing University team led by Fu Chaoyou, invited by the Google Gemini evaluation group, released the next‑generation benchmark Video‑MME‑v2, which features a three‑layer ability hierarchy and a group‑level non‑linear scoring mechanism.

Benchmark Design

Video‑MME‑v2 decomposes video understanding into three progressive layers:

Information Retrieval and Aggregation : tests whether the model can accurately identify and extract key facts across frames and modalities.

Temporal Understanding : builds on the first layer to assess the model’s grasp of the time dimension, requiring recognition of action order, state changes, and event causality.

Complex Reasoning : the highest layer demands inference in open‑ended scenarios, evaluating whether the model can not only see but also explain and synthesize information like a human.

The benchmark contains 800 videos and 3,200 questions, annotated by 12 annotators and reviewed by 50 independent auditors, with over 3,300 human‑hours of effort across five rounds of cross‑review.

Group‑Level Non‑Linear Scoring

Instead of scoring each question independently, Video‑MME‑v2 groups related questions and applies two novel mechanisms:

Consistency Group : rewards models that consistently answer all questions in a group, reflecting true capability rather than isolated correct answers.

Coherence Group : evaluates whether a model can follow a logical chain across multi‑step reasoning tasks; a single mistake truncates scoring for the rest of the group ("first‑error cut‑off").

For consistency groups, the score is computed by an "incentive" scheme—more correct answers within a group yield higher rewards, preventing a model from achieving a high overall score by only answering a few scattered questions correctly.

Evaluation Results

Human experts achieve a group‑level non‑linear score of 90.7 (average accuracy 94.9). The best commercial model, Gemini‑3‑Pro, scores 49.4, while the strongest open‑source model, Qwen3.5‑397B‑A17B‑Think (512 frames), reaches 39.1. This demonstrates a substantial performance gap even under the stricter evaluation.

The benchmark also reports the ratio of Non‑Linear Score to Avg Acc, showing that Gemini‑3‑Pro attains about 75 % of its average accuracy in the consistency metric, whereas smaller models drop to 40 % or lower, indicating weaker stability and robustness.

Insights on the "Thinking" Feature

Experiments reveal that the "Thinking" (chain‑of‑thought) augmentation does not universally improve performance; its benefit depends heavily on the presence of textual cues. For example, Qwen3.5‑122B‑A10B‑Think gains +3.8/+5.8 points with subtitles, but models like Qwen3‑V‑L‑8B and KimiVL‑16B experience drops of up to –4.0 points when subtitles are absent, especially on Level 3 reasoning tasks.

These findings suggest that many current models rely more on language anchors than on stable visual or auditory evidence, and that inappropriate use of "Thinking" can introduce noise.

Conclusion

Video‑MME‑v2 aims to shift the evaluation paradigm for video understanding toward measuring continuous, multimodal comprehension akin to human perception. By emphasizing consistency, coherence, and non‑linear scoring, it provides a more realistic assessment of model capabilities and highlights areas where even the strongest models still fall short of human performance.

large models AI evaluation video understanding multimodal benchmark non-linear scoring

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.