Artificial Intelligence 12 min read

WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios

WorldSense, a new benchmark of 1,662 real‑world video‑audio clips and 3,172 QA pairs across 26 cognitive tasks, reveals that current multimodal large models achieve only 25%–48% accuracy, highlighting the crucial role of combined visual‑audio input and the difficulty of audio‑ and emotion‑related reasoning.

Xiaohongshu Tech REDtech

Feb 17, 2025

WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios

WorldSense is the first benchmark dataset jointly released by Xiaohongshu and Shanghai Jiao Tong University for assessing the full‑modal understanding ability of multimodal large models (MLLMs) in real‑world scenes.

The benchmark contains 1,662 synchronized video‑audio clips covering eight major domains and 67 fine‑grained sub‑categories, together with 3,172 multiple‑choice QA pairs spanning 26 cognitive tasks such as object recognition, sound identification, causal reasoning, and abstract concept understanding.

Extensive evaluation on various state‑of‑the‑art MLLMs shows that open‑source video‑audio models achieve only about 25% accuracy—close to random guessing—while the best proprietary model, Gemini 1.5 Pro, reaches merely 48%, far below the reliability required for real‑world applications.

Key findings include:

Full‑modal collaboration (visual + audio) is essential; performance drops ~15% when any modality is missing.

Audio‑related tasks and emotion‑related tasks are the most challenging for current models.

Visual information generally improves accuracy (e.g., Gemini 1.5 Pro rises from 34.6% with audio‑only to 48.0% with added video frames), but the effect varies across models.

Original audio provides richer cues than subtitles, especially for tasks involving tone, emotion, or environmental sounds.

Increasing video‑frame sampling density usually boosts performance, though some models (e.g., LLaMA‑3.2) do not benefit.

All QA pairs were manually annotated by 80 experts and further verified using automated MLLM validation techniques to ensure high quality and reliability.

The paper describing WorldSense is available at https://arxiv.org/abs/2502.04326 , and the project homepage is https://jaaackhongggg.github.io/WorldSense/ .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI large models benchmark dataset model analysis video‑audio evaluation

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.