Artificial Intelligence 9 min read

CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning, offering 5,331 videos and 9,015 high‑quality QA pairs across four reasoning dimensions, and revealing that even the strongest models achieve only about 50% accuracy compared with human performance.

AI Frontier Lectures

Dec 9, 2025

CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning

Background

Current multimodal large language models (MLLMs) such as Qwen‑VL and Gemini‑3 excel at single‑video understanding but lack true cross‑video reasoning, i.e., the ability to compare, sequence, and aggregate information across multiple videos.

CrossVid Benchmark

CrossVid is a fully open‑source benchmark designed to evaluate cross‑video reasoning. It contains 5,331 curated videos covering 32 topics and 9,015 high‑quality question‑answer pairs . The benchmark defines four high‑level dimensions—comparison analysis, temporal understanding, multi‑view reasoning, and free‑form QA—and ten concrete tasks such as action understanding, narrative comprehension, cooking comparison, and multi‑view counting.

Data Construction Pipeline

The dataset was built in three core steps:

AI‑generated descriptions: Qwen2.5‑VL‑72B performed dense frame captioning on all video frames, producing detailed visual context. DeepSeek‑R1 then generated challenging cross‑video QA pairs using prompts that forced the model to output the reasoning process.

Expert refinement: Ten professional annotators performed three rounds of cleaning: Filtration (removing questions answerable from a single video), Refinement (eliminating ambiguity and improving answer options), and Anti‑Shortcut design (temporal realignment of video clips to prevent models from guessing order based on visual cues).

Manual labeling for multi‑view tasks: For tasks requiring precise spatial reasoning (e.g., counting vehicles from drone views), annotators manually marked object coordinates and relationships using a custom dual‑view annotation tool.

Evaluation Results

Twenty‑two mainstream MLLMs were evaluated, including closed‑source models (GPT‑4.1, Gemini‑2.5‑Pro) and open‑source models (Qwen2.5‑VL, InternVL3). The best performer, Gemini‑2.5‑Pro, achieved an average accuracy of 50.4 % , far below the human baseline of 89.2 % . Performance on multi‑view reasoning and temporal understanding dropped dramatically (e.g., human 85.2 % vs. best model 13.4 % on action alignment).

Error Analysis

Four dominant failure modes were identified:

Key‑frame loss: Compressing multiple videos forces models to drop critical visual details.

Video understanding error: Misinterpretation of a single video propagates errors when aggregating across videos.

Cross‑video comparison error: Models often hallucinate or break logical chains when required to compare information from different videos.

Inability to aggregate distributed evidence: Current architectures tend to process videos independently rather than fusing dispersed clues.

Future Directions

Suggested research avenues include:

More efficient long‑context handling to accommodate many video frames.

Model architectures explicitly designed for cross‑video information exchange.

Leveraging the CrossVid dataset to train models that truly understand groups of videos.

Repository

CrossVid code and data are fully open‑source at https://github.com/chuntianli666/CrossVid.

https://mmbiz.qpic.cn/sz_mmbiz_gif/AIR6eRePgjOEKngj1JFUEl14bTgYXK5ZvcX07wU34yBT0ZBnz2MPz5mtsZC5wRM9NsejQ8C6ALH4U9ZkmPmfdg/640

AI evaluation video understanding cross-video reasoning

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.