Tagged articles
3 articles
Page 1 of 1
SuanNi
SuanNi
Mar 2, 2026 · Artificial Intelligence

Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark

The newly released Humanity’s Last Exam (HLE) benchmark, featuring 2,500 rigorously crafted multimodal questions across more than 100 disciplines, exposes the severe shortcomings of leading AI models, whose accuracy stays below 50% and shows alarming calibration errors, highlighting the urgent need for deeper AI evaluation.

Artificial IntelligenceHumanity's Last ExamMultimodal Evaluation
0 likes · 13 min read
Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark
Tencent Technical Engineering
Tencent Technical Engineering
Jun 30, 2025 · Artificial Intelligence

How iMatch Won CVPR2025 NTIRE Image-Text Alignment: Techniques & Benchmarks

The IH‑VQA team’s iMatch solution clinched the CVPR2025 NTIRE Image‑Text Alignment champion by introducing dual‑model fusion, pseudo‑label data augmentation, Q‑Align probability mapping, and visual augmentations, and the paper also presents a comprehensive iMatch benchmark evaluating 23 state‑of‑the‑art text‑to‑image models across multiple resolutions.

AI quality assessmentCVPR2025Multimodal Evaluation
0 likes · 15 min read
How iMatch Won CVPR2025 NTIRE Image-Text Alignment: Techniques & Benchmarks
Sohu Tech Products
Sohu Tech Products
Jul 31, 2024 · Artificial Intelligence

MMEvalPro: A Trustworthy Benchmark for Evaluating Multimodal Large Models

MMEvalPro, a new benchmark created by researchers from Peking University, Chinese Academy of Medical Sciences, CUHK and Alibaba, augments existing multimodal datasets with perception and knowledge questions and introduces a Genuine Accuracy metric, revealing that top multimodal models still lag far behind humans and exposing shortcut‑driven performance on prior tests.

Large Language ModelsMMEvalProMultimodal Evaluation
0 likes · 11 min read
MMEvalPro: A Trustworthy Benchmark for Evaluating Multimodal Large Models