Can Full‑Modal AI Agents Master Vision, Audio, and Tools? Meet OmniGAIA & OmniAtlas
This article introduces OmniGAIA, a challenging full‑modal benchmark with 360 real‑world tasks, and OmniAtlas, a training framework that equips multimodal agents with active perception and tool‑integrated reasoning, showing substantial performance gains over existing open‑source models through extensive experiments and analysis.
Background and Motivation
Human intelligence naturally combines vision, hearing, language, deep reasoning, and tool use, yet most current multimodal large models (MLLMs) are limited to dual‑modal interactions such as image‑text or audio‑text, lacking the full‑modal cognition, long‑range reasoning, and tool‑calling abilities required of a general AI assistant.
OmniGAIA Benchmark
To fill this gap, researchers from Renmin University, Xiaohongshu, Southeast University, Zhejiang University, and Tsinghua University released OmniGAIA , a high‑difficulty benchmark designed to evaluate native full‑modal AI agents. OmniGAIA contains 360 tasks sourced from real scenarios across nine domains (geography, history, science, art, sports, finance, etc.) with inputs ranging from long‑duration video + audio to image + audio combinations. Tasks demand multi‑hop reasoning, multi‑round tool calls, and open‑ended answers that can be uniquely verified.
Demo Tasks
Image + Audio: "How far apart in time are the events shown in the picture and described in the audio?"
Video with Audio: "A video mentions a moving bridge from a movie; retrieve detailed information about that bridge."
Why Existing Benchmarks Fall Short
Although models like Qwen3‑Omni and Gemini‑3 can process text, vision, and audio within a single architecture, prior benchmarks (e.g., OmniBench, WorldSense) focus on very short clips and mainly pose perception‑type multiple‑choice questions. Real‑world tasks are far more complex, requiring the model to locate key information in long videos or multiple images, verify facts via search engines, and perform multi‑step reasoning and computation. OmniGAIA was created to assess precisely these capabilities.
Construction of OmniGAIA
The dataset was built through a systematic pipeline:
Data Sources: Video data aggregated from FineVideo, LongVideoBench, LongVideo‑Reason; image‑audio pairs combined COCO 2017 with authentic audio tracks.
Information Mining: Gemini‑3‑Flash parses raw material into fine‑grained segments (≤60 s video clips, timestamped ASR, speaker diarization, audio event detection, OCR, object/person recognition, scene summaries).
Event Graph Construction & Expansion: DeepSeek‑V3.2 creates a “full‑modal event graph” linking cross‑modal entities and relations; external tools (search, browsing, image retrieval, visual QA, code execution) are then used to add “next‑hop evidence” and extend the graph.
QA Generation & Review: An “event fuzzification” strategy masks or abstracts key entities/attributes, turning simple fact queries into multi‑modal, multi‑hop reasoning problems. Samples undergo LLM pre‑screening and human verification to ensure naturalness, correctness, and uniqueness of answers.
Statistical breakdowns (shown in the accompanying figures) illustrate the distribution of task types, input lengths, and domain coverage.
OmniAtlas: A Native Full‑Modal Agent Framework
To improve open‑source agents on OmniGAIA, the authors propose OmniAtlas , a training framework that follows a Tool‑Integrated Reasoning (TIR) paradigm, intertwining internal thought with external tool calls along a single trajectory. OmniAtlas focuses on three core capabilities:
1. Active Perception
When the model suspects that crucial information resides in a specific audio segment, video clip, or image region, it can invoke built‑in tools read_video, read_audio, or read_image to fetch only the relevant portion, preserving detail while reducing cost.
2. High‑Quality Trajectory Synthesis & Supervised Fine‑Tuning
The authors introduce a “trajectory synthesis + supervised learning” pipeline: raw multimodal inputs are first converted into high‑quality textual descriptions; a strong reasoning model then performs hindsight‑guided tree exploration, sampling multiple “thought + action” branches at each step. Correct trajectories are retained after pruning with reference answers and validators.
During supervised fine‑tuning, loss is applied only to the model‑generated “thought tokens” and “action tokens”, not to the tool‑returned observations, encouraging the model to learn how to think and decide rather than merely mimic tool outputs.
3. OmniDPO Fine‑Grained Error Correction
Because a single mistake can cascade in multimodal tasks, OmniDPO first lets a SFT‑trained model explore the training set autonomously, then uses a stronger model to locate the first error in failed trajectories and generate a corrected prefix. This yields focused positive‑negative pairs that more effectively rectify perception, retrieval, tool use, and reasoning errors.
Experimental Results
1. Main Benchmark Performance
Closed‑source Gemini‑3‑Pro leads with a Pass@1 of 62.5%.
The strongest open‑source baseline Qwen‑3‑Omni (30B) reaches only 13.3%, a 4.7× gap.
Model scale alone does not determine performance: a 560B LongCat‑Flash‑Omni scores 11.1%, lower than the 30B Qwen‑3‑Omni.
OmniAtlas boosts Qwen‑3‑Omni from 13.3% to 20.8% (+7.5%); on a 7B model accuracy jumps from 3.6% to 13.3% (≈4×).
2. Fine‑Grained Error Analysis
Over 90% of failures in difficult tasks stem from improper tool use (missing calls, wrong direction, or endless loops).
Gemini‑3‑Pro exhibits far lower tool‑use error rates (35.3% vs 81.1%) and reasoning error rates (15.8% vs 79.7%) compared with open‑source models.
OmniAtlas reduces tool‑use errors from 81.1% to 59.4% and reasoning errors from 79.7% to 64.4%, though perception errors remain at 30‑50%.
3. Tool‑Calling Behavior
Models that never call tools achieve very low success, confirming that external tools are essential for complex tasks.
More tool calls do not guarantee better results; excessive calls often lead to ineffective loops.
OmniAtlas exhibits broader and more proactive tool usage, directly improving task success rates.
4. Native Perception vs. External Tools
Four configurations were evaluated (native only, vision + audio tools, audio + vision tools, both tools) on a strong model (Gemini‑3‑Flash) and a weaker model (Qwen‑3‑Omni):
For the strong model, native perception yields the highest accuracy (51.7%) with the fewest tool calls (4.4). Adding external tools reduces accuracy (down to 43.3%) and doubles call cost.
For the weak model, external tools modestly improve easy‑task accuracy (19.7% → 24.6%) but dramatically hurt difficult‑task performance (9.0% → 3.9%).
Conclusion: Native multimodal perception is the optimal solution for strong models, while external tools serve only as temporary patches for weaker models and cannot replace deep cross‑modal reasoning.
5. Training Strategy Ablation
Two stages were quantified:
OmniAtlas‑SFT: Provides the bulk of gains, raising Qwen‑3‑Omni‑30B Pass@1 from 13.3% to 18.9% and cutting invalid tool‑call rate from 81.1% to 65.3%.
OmniDPO: Further refines performance, pushing the same model to 20.8% and consistently lowering perception, tool‑use, and logical reasoning error rates.
Summary and Future Directions
OmniGAIA reveals critical shortcomings of current full‑modal models in long‑range reasoning and tool usage, while OmniAtlas offers an effective training recipe that markedly improves open‑source agents. The authors suggest three promising research avenues:
Full‑modal Agentic Reinforcement Learning to directly optimize long‑term decision policies from real feedback.
Building a full‑modal MCP ecosystem to integrate richer tool sets and expand application boundaries.
Developing full‑modal embodied AI agents that interact with the physical world.
Paper link: https://arxiv.org/pdf/2602.22897
Code & Demo: https://github.com/RUC-NLPIR/OmniGAIA
Dataset & Model: https://huggingface.co/collections/RUC-NLPIR/omnigaia
Leaderboard: https://huggingface.co/spaces/RUC-NLPIR/OmniGAIA-LeaderBoardHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
