WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios
WorldSense, a new benchmark of 1,662 real‑world video‑audio clips and 3,172 QA pairs across 26 cognitive tasks, reveals that current multimodal large models achieve only 25%–48% accuracy, highlighting the crucial role of combined visual‑audio input and the difficulty of audio‑ and emotion‑related reasoning.
WorldSense is the first benchmark dataset jointly released by Xiaohongshu and Shanghai Jiao Tong University for assessing the full‑modal understanding ability of multimodal large models (MLLMs) in real‑world scenes.
The benchmark contains 1,662 synchronized video‑audio clips covering eight major domains and 67 fine‑grained sub‑categories, together with 3,172 multiple‑choice QA pairs spanning 26 cognitive tasks such as object recognition, sound identification, causal reasoning, and abstract concept understanding.
Extensive evaluation on various state‑of‑the‑art MLLMs shows that open‑source video‑audio models achieve only about 25% accuracy—close to random guessing—while the best proprietary model, Gemini 1.5 Pro, reaches merely 48%, far below the reliability required for real‑world applications.
Key findings include:
Full‑modal collaboration (visual + audio) is essential; performance drops ~15% when any modality is missing.
Audio‑related tasks and emotion‑related tasks are the most challenging for current models.
Visual information generally improves accuracy (e.g., Gemini 1.5 Pro rises from 34.6% with audio‑only to 48.0% with added video frames), but the effect varies across models.
Original audio provides richer cues than subtitles, especially for tasks involving tone, emotion, or environmental sounds.
Increasing video‑frame sampling density usually boosts performance, though some models (e.g., LLaMA‑3.2) do not benefit.
All QA pairs were manually annotated by 80 experts and further verified using automated MLLM validation techniques to ensure high quality and reliability.
The paper describing WorldSense is available at https://arxiv.org/abs/2502.04326 , and the project homepage is https://jaaackhongggg.github.io/WorldSense/ .
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.