CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning
CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning tasks, providing 5,331 videos, 9,015 QA pairs, four high‑level dimensions and ten specific tasks, and exposing significant performance gaps between current models and humans.
Background
Multimodal large language models (MLLMs) such as Qwen3‑VL and Gemini‑3 achieve strong performance on single‑video understanding, but they cannot perform cross‑video reasoning —the ability to compare, aggregate, and infer across multiple videos. This capability is required for tasks like ingredient comparison, storyline stitching, and multi‑view analysis.
CrossVid Benchmark
CrossVid is an open‑source benchmark designed to evaluate cross‑video reasoning. It contains 5,331 curated videos covering 32 topics and provides 9,015 high‑quality question‑answer pairs . The benchmark defines four high‑level dimensions (comparison analysis, temporal understanding, multi‑view reasoning, free‑form QA) and ten concrete tasks, including behavior understanding, narrative comprehension, cooking comparison, and multi‑view counting.
Data Construction Pipeline
AI‑generated descriptions: Video frames are densely captioned with Qwen2.5‑VL‑72B. The captions and original metadata are fed to DeepSeek‑R1 using carefully crafted prompts to generate challenging cross‑video QA pairs.
Expert refinement: Ten professional annotators perform three rounds of cleaning:
Filtration : remove questions solvable with a single video.
Refinement : eliminate ambiguity and improve answer options.
Anti‑Shortcut : temporally realign video segments to prevent models from guessing based on low‑level visual cues.
Manual labeling for multi‑view tasks: For spatially demanding tasks (e.g., drone‑view vehicle counting), annotators use a custom dual‑view tool to manually mark object coordinates and relationships.
Evaluation
Twenty‑two mainstream MLLMs were evaluated, including closed‑source models (GPT‑4.1, Gemini‑2.5‑Pro) and open‑source models (Qwen2.5‑VL, InternVL3, etc.). The best performer, Gemini‑2.5‑Pro, achieved an average accuracy of 50.4 % , far below the human average of 89.2 % . In the most difficult action‑alignment task, the strongest model scored only 13.4 % compared to 85.2 % for humans, highlighting severe gaps in multi‑view reasoning and temporal understanding.
Failure Analysis
Key‑frame loss: Compressing multiple videos forces a reduction in frame count, causing loss of critical details.
Video understanding errors: Misinterpretations in a single video propagate when aggregating information across videos.
Cross‑video comparison failures: Models may hallucinate or produce logical breaks when required to compare content from different videos.
Inability to integrate distributed evidence: Current architectures tend to process each video independently rather than fusing clues from multiple sources.
Future Directions
Develop more efficient long‑context mechanisms to handle larger frame sequences.
Design architectures specifically optimized for cross‑video reasoning, enhancing inter‑video information exchange.
Leverage the CrossVid dataset to train models that understand “video groups.”
Resources
Paper: https://arxiv.org/abs/2511.12263
Code and data: https://github.com/chuntianli666/CrossVid
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
