How Alibaba Cloud’s Ops‑Agentic‑Search Reached Human‑Level Performance on the GAIA Benchmark
The article explains the shift of AI agents from passive responders to proactive executors, outlines the challenges of hallucination, task failure, and consistency, introduces the GAIA benchmark, and details how Alibaba Cloud's Ops‑Agentic‑Search achieved a 92.36% accuracy—matching human experts—through global planning, reflection, dynamic context management, and a self‑evolving skills system.
Background
Large language models (LLMs) have enabled AI systems to move from passive question‑answering to autonomous, multi‑step execution. An AI agent must perceive its environment, decompose goals, invoke external tools, and iterate actions to complete long‑running, cross‑system tasks.
Key Challenges
Hallucination propagation : errors amplify across steps.
High task‑failure rate : complex tasks often abort midway.
Long‑term consistency : goal drift is common.
Tool‑call reliability : lack of unified standards for external tool integration.
GAIA Benchmark
The General AI Assistants (GAIA) benchmark, co‑created by Meta AI, Hugging Face and others, evaluates agents on 466 real‑world scenarios covering reasoning, multimodal processing, web browsing and tool usage. Human experts score 92% while GPT‑4 averages below 30%.
Ops‑Agentic‑Search Performance
Alibaba Cloud’s Ops‑Agentic‑Search framework achieved 92.36% on GAIA, the first system to reach human‑level performance, surpassing competitors such as Manus and OpenAI Deep Research.
Core Technical Advantages
Deep integration with OpenSearch for powerful retrieval.
End‑to‑end reasoning loop: task understanding, dynamic planning, tool execution, feedback iteration, and evaluation.
Native multimodal support (text, image, video, audio).
Built‑in browser automation, code execution, file operations, MCP protocol compatibility, and a self‑evolving Skills system.
Framework Capability Overview
1. Global Dynamic Planning (plan_with_files)
The plan_with_files mechanism externalizes the planning process, intermediate results and execution state into files. This decouples planning from execution, removes context‑window limits, enables checkpoint‑resume, and improves consistency by reloading the plan before each action.
2. Self‑Reflection Mechanism
Agents evaluate their own outputs, detect errors and iteratively adjust strategies, driving quality convergence and suppressing hallucinations.
Execution Output → Cross‑validation → Error Detection → Strategy Adjustment → Re‑execution3. Dynamic Context Management
Two complementary strategies balance completeness, coherence and resource efficiency:
Summary strategy : semantic compression retains key reasoning nodes and converts redundant content into concise summaries, ideal for long dialogues.
Discard strategy : evaluates timeliness, relevance and dependency to prune low‑priority content when the context window is saturated.
4. Self‑Evolving Skills System
Skills are automatically extracted from multiple reasoning paths and refined through a closed loop: execute → extract → apply → re‑extract . This enables rapid skill reuse, continuous quality improvement and avoids redundant exploration for similar tasks.
Application Scenarios
Enterprise Knowledge Q&A : internal document‑based chatbot with > 92% accuracy.
Market Research Report Generation : automated data collection and analysis, > 10× efficiency.
Code‑assisted Development : requirement understanding, code generation and debugging, > 50% productivity boost.
Data Analysis Reporting : automatic extraction and visualization, minutes‑level report generation.
Customer Service Automation : issue understanding, knowledge‑base lookup, > 90% resolution rate.
Complex Research Task Example
Task: Analyze the 2025 global AI Agent market, identify major vendors, technical roadmaps, market shares and forecast the next three years.
Step 1: Task decomposition
├─ Sub‑task 1: Collect 2025 vendor information
├─ Sub‑task 2: Compare technical roadmaps
├─ Sub‑task 3: Gather market‑share data
└─ Sub‑task 4: Forecast 3‑year trends
Step 2: Parallel information collection
├─ Search authoritative reports (Gartner, IDC…)
├─ Browse vendor websites
├─ Retrieve academic papers & blogs
└─ Analyze open‑source community activity
Step 3: Integration & analysis
├─ Cross‑validate multiple sources
├─ Identify key trends
└─ Generate structured analysis report
Step 4: Report generation
├─ Write executive summary
├─ Produce detailed analysis chapters
├─ Create comparative tables & charts
└─ Output final research reportAutomated > 20 web‑browsing actions.
Aggregated > 15 authoritative reports.
Generated a full research report with charts in under 5 minutes.
Ops‑Agentic‑Search Product Overview
Deep Retrieval : multi‑agent collaborative progressive search.
Task Execution : end‑to‑end handling of complex multi‑step tasks.
Tool Invocation : built‑in browser, code execution, file operations.
Multimodal Understanding : support for text, image, video and audio.
Knowledge‑Base Integration : seamless connection to enterprise knowledge bases and OpenSearch indexes.
Result Verification : automatic validation of information accuracy and source reliability.
Quick Start
Documentation: https://developer.aliyun.com/article/1708935
Live demo: https://opensearch.console.aliyun.com/cn-shanghai/rag/agentic-search
Conclusion & Outlook
Ops‑Agentic‑Search’s top rank on GAIA demonstrates that AI agents can now operate at human‑expert levels (92.36% accuracy). The open‑source core components and participation in standards such as MCP aim to foster industry progress and enable large‑scale enterprise adoption of AI agents.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
