Artificial Intelligence 9 min read

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning tasks, providing 5,331 videos, 9,015 QA pairs, four high‑level dimensions and ten specific tasks, and exposing significant performance gaps between current models and humans.

Xiaohongshu Tech REDtech

Dec 4, 2025

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

Background

Multimodal large language models (MLLMs) such as Qwen3‑VL and Gemini‑3 achieve strong performance on single‑video understanding, but they cannot perform cross‑video reasoning —the ability to compare, aggregate, and infer across multiple videos. This capability is required for tasks like ingredient comparison, storyline stitching, and multi‑view analysis.

CrossVid Benchmark

CrossVid is an open‑source benchmark designed to evaluate cross‑video reasoning. It contains 5,331 curated videos covering 32 topics and provides 9,015 high‑quality question‑answer pairs . The benchmark defines four high‑level dimensions (comparison analysis, temporal understanding, multi‑view reasoning, free‑form QA) and ten concrete tasks, including behavior understanding, narrative comprehension, cooking comparison, and multi‑view counting.

Data Construction Pipeline

AI‑generated descriptions: Video frames are densely captioned with Qwen2.5‑VL‑72B. The captions and original metadata are fed to DeepSeek‑R1 using carefully crafted prompts to generate challenging cross‑video QA pairs.

Expert refinement: Ten professional annotators perform three rounds of cleaning:

Filtration : remove questions solvable with a single video.

Refinement : eliminate ambiguity and improve answer options.

Anti‑Shortcut : temporally realign video segments to prevent models from guessing based on low‑level visual cues.

Manual labeling for multi‑view tasks: For spatially demanding tasks (e.g., drone‑view vehicle counting), annotators use a custom dual‑view tool to manually mark object coordinates and relationships.

Evaluation

Twenty‑two mainstream MLLMs were evaluated, including closed‑source models (GPT‑4.1, Gemini‑2.5‑Pro) and open‑source models (Qwen2.5‑VL, InternVL3, etc.). The best performer, Gemini‑2.5‑Pro, achieved an average accuracy of 50.4 % , far below the human average of 89.2 % . In the most difficult action‑alignment task, the strongest model scored only 13.4 % compared to 85.2 % for humans, highlighting severe gaps in multi‑view reasoning and temporal understanding.

Failure Analysis

Key‑frame loss: Compressing multiple videos forces a reduction in frame count, causing loss of critical details.

Video understanding errors: Misinterpretations in a single video propagate when aggregating information across videos.

Cross‑video comparison failures: Models may hallucinate or produce logical breaks when required to compare content from different videos.

Inability to integrate distributed evidence: Current architectures tend to process each video independently rather than fusing clues from multiple sources.

Future Directions

Develop more efficient long‑context mechanisms to handle larger frame sequences.

Design architectures specifically optimized for cross‑video reasoning, enhancing inter‑video information exchange.

Leverage the CrossVid dataset to train models that understand “video groups.”

Resources

Paper: https://arxiv.org/abs/2511.12263

Code and data: https://github.com/chuntianli666/CrossVid

AI evaluation video understanding multimodal LLM cross-video reasoning

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.