Artificial Intelligence 15 min read

Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite

UNO‑Bench, an open‑source benchmark from Meituan’s LongCat team, provides the first high‑quality, low‑redundancy unified evaluation framework for omni‑modal large language models, featuring 1,250 manually annotated cross‑modal samples and 2,480 enhanced single‑modal samples covering 44 fine‑grained tasks and five modality combinations.

Baobao Algorithm Notes

Nov 13, 2025

Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite

Background

Omni‑modal large language models (LLMs) now aim to jointly understand vision, audio, and text. Existing evaluation suites suffer from data contamination, redundant information, fragmented benchmarks, and little Chinese coverage, which prevents reliable measurement of true cross‑modal fusion ability.

Core Contributions of UNO‑Bench

Unified omni‑modal benchmark – Provides a single framework that evaluates both single‑modality and full‑modality tasks, revealing a combination law: weak models hit a bottleneck, while strong models gain super‑linear synergy.

High‑quality, diverse data pipeline – A hybrid human‑machine workflow yields 1,250 manually curated cross‑modal samples (98% require cross‑modal reasoning) and 2,480 enhanced single‑modality samples, covering 44 fine‑grained task types across five modality combinations. The human‑annotated set is fully bilingual (English/Chinese).

Multi‑step open‑ended (MO) evaluation – Replaces binary multiple‑choice questions with decomposed, inter‑dependent sub‑questions. A universal scoring model supports six question types and attains 95% accuracy on out‑of‑distribution models.

Technical Solution

1. Unified Capability Ontology

A two‑dimensional, seven‑layer ontology integrates single‑modality and omni‑modal abilities.

Perception layer – Object, attribute, scene recognition; spatial relations; cross‑modal conversion; semantic understanding; and explicit cross‑modal alignment.

Reasoning layer – General reasoning (split into commonsense and logical), STEM, code, plus spatial (static/dynamic), temporal, and complex reasoning.

The ontology spans 44 atomic tasks, including seven exclusive omni‑modal tasks such as audio‑visual synchronization, enabling fine‑grained alignment and cross‑dimensional comparability.

2. Forced Cross‑Modal Data Construction Protocol

To avoid data leakage and improve coverage, the protocol emphasizes:

Comprehensiveness – supplement low‑coverage perception questions and add reasoning items, especially video‑audio combos.

Diversity – enrich material types not covered by self‑constructed data.

Quality – ensure reasonable and accurate single‑modality answers.

Discriminability – remove overly difficult or low‑discriminative subsets.

For large‑scale dataset compression, a clustering‑guided hierarchical sampling (CGHS) method is used. CGHS clusters questions using K‑means++ on model‑performance vectors and samples proportionally from each cluster, preserving key samples while reducing redundancy.

CGHS Steps:
1. Problem Representation: Encode each problem as an x‑dimensional vector of model scores.
2. Cluster Hierarchy: Apply K‑means++ to form k clusters representing similar model behavior.
3. Hierarchical Sampling: Determine sample counts per cluster based on size, then randomly sample.
4. Validation: Use SRCC, PLCC, RMSE, MoE, and CIC metrics to verify compression quality.

Five random splits with 10‑fold cross‑validation show that CGHS reduces evaluation cost by ~90% while maintaining 98% consistency.

3. Multi‑Step Open‑Ended Questions (MO)

MO questions decompose a complex task into two or more inter‑dependent sub‑questions, assigning scores that sum to 10 points. Models must generate step‑by‑step open‑ended answers, allowing precise quantification of reasoning‑chain progress.

The universal scoring model, built on Qwen‑3‑14B, defines six question types (numeric, enumeration, judgment, short answer, essay, multi‑choice) with dedicated scoring criteria, achieving 95% accuracy on OOD models.

Empirical Findings

1. Scaling Law for Omni‑Modal Models

Fitting overall scores to a power‑law yields:

γ ≈ 2.19 > 1 – indicates super‑linear synergy for strong models ("1 + 1 ≫ 2").

C ≈ 1.03 – shows internal consistency of the benchmark.

b ≈ 0.24 – matches random‑guess baseline, confirming reasonable difficulty.

The law quantifies modal‑fusion efficiency; models deviating from the curve reveal structural deficiencies.

2. Perception vs. Reasoning Gap

Average perception scores (47.3) exceed reasoning scores (38.6). The best reasoning performance (Gemini‑2.5‑Pro, 45.00) still lags behind perception, highlighting weaknesses in physical‑world understanding.

3. Open‑Source vs. Closed‑Source Gap

Open‑source model Qwen‑3‑Omni‑30B scores 37.41 on reasoning, 33 points behind Gemini‑2.5‑Pro (70.41), a larger gap than in perception, confirming reasoning as the primary differentiator.

4. Human‑Machine Comparison

Expert evaluation shows perception alignment (Gemini‑2.5‑Pro ≈ human) but reasoning remains inferior (human 81.25 vs. model 70.41), exposing challenges in abstract concept transfer and counterfactual reasoning.

5. Modality Ablation Experiments

Vision ablation – Adding textual descriptions boosts Qwen‑3‑Omni‑30B by >20 points, demonstrating strong scene reconstruction from text. However, raw visual input still outperforms description‑only for Gemini‑2.5‑Pro, confirming irreducible fine‑grained visual information.

Audio ablation – Open‑source models rely on textual audio descriptions, indicating weak non‑speech signal parsing. Gemini models excel with raw audio, showing robust acoustic feature extraction. Direct speech input also captures prosody, tone, and pauses, improving performance over ASR transcripts.

Conclusion

UNO‑Bench delivers a quantifiable, scalable, and interpretable industrial‑grade evaluation platform for omni‑modal LLMs. Its innovations in data construction, multi‑step open‑ended questioning, and unified ontology provide comprehensive benchmarking and uncover a universal scaling law that can guide future model optimization. Perception capabilities approach human levels, while high‑order reasoning remains the main frontier.

Project homepage: https://github.com/meituan-longcat/UNO-Bench

Dataset: https://huggingface.co/datasets/meituan-longcat/UNO-Bench