12 min read

How Alibaba Cloud’s Ops‑Agentic‑Search Reached Human‑Level Performance on the GAIA Benchmark

The article explains the shift of AI agents from passive responders to proactive executors, outlines the challenges of hallucination, task failure, and consistency, introduces the GAIA benchmark, and details how Alibaba Cloud's Ops‑Agentic‑Search achieved a 92.36% accuracy—matching human experts—through global planning, reflection, dynamic context management, and a self‑evolving skills system.

Alibaba Cloud Big Data AI Platform

Apr 2, 2026

How Alibaba Cloud’s Ops‑Agentic‑Search Reached Human‑Level Performance on the GAIA Benchmark

Background

Large language models (LLMs) have enabled AI systems to move from passive question‑answering to autonomous, multi‑step execution. An AI agent must perceive its environment, decompose goals, invoke external tools, and iterate actions to complete long‑running, cross‑system tasks.

Key Challenges

Hallucination propagation : errors amplify across steps.

High task‑failure rate : complex tasks often abort midway.

Long‑term consistency : goal drift is common.

Tool‑call reliability : lack of unified standards for external tool integration.

GAIA Benchmark

The General AI Assistants (GAIA) benchmark, co‑created by Meta AI, Hugging Face and others, evaluates agents on 466 real‑world scenarios covering reasoning, multimodal processing, web browsing and tool usage. Human experts score 92% while GPT‑4 averages below 30%.

Ops‑Agentic‑Search Performance

Alibaba Cloud’s Ops‑Agentic‑Search framework achieved 92.36% on GAIA, the first system to reach human‑level performance, surpassing competitors such as Manus and OpenAI Deep Research.

Core Technical Advantages

Deep integration with OpenSearch for powerful retrieval.

End‑to‑end reasoning loop: task understanding, dynamic planning, tool execution, feedback iteration, and evaluation.

Native multimodal support (text, image, video, audio).

Built‑in browser automation, code execution, file operations, MCP protocol compatibility, and a self‑evolving Skills system.

Framework Capability Overview

1. Global Dynamic Planning (plan_with_files)

The plan_with_files mechanism externalizes the planning process, intermediate results and execution state into files. This decouples planning from execution, removes context‑window limits, enables checkpoint‑resume, and improves consistency by reloading the plan before each action.

2. Self‑Reflection Mechanism

Agents evaluate their own outputs, detect errors and iteratively adjust strategies, driving quality convergence and suppressing hallucinations.

Execution Output → Cross‑validation → Error Detection → Strategy Adjustment → Re‑execution

3. Dynamic Context Management

Two complementary strategies balance completeness, coherence and resource efficiency:

Summary strategy : semantic compression retains key reasoning nodes and converts redundant content into concise summaries, ideal for long dialogues.

Discard strategy : evaluates timeliness, relevance and dependency to prune low‑priority content when the context window is saturated.

4. Self‑Evolving Skills System

Skills are automatically extracted from multiple reasoning paths and refined through a closed loop: execute → extract → apply → re‑extract . This enables rapid skill reuse, continuous quality improvement and avoids redundant exploration for similar tasks.

Application Scenarios

Enterprise Knowledge Q&A : internal document‑based chatbot with > 92% accuracy.

Market Research Report Generation : automated data collection and analysis, > 10× efficiency.

Code‑assisted Development : requirement understanding, code generation and debugging, > 50% productivity boost.

Data Analysis Reporting : automatic extraction and visualization, minutes‑level report generation.

Customer Service Automation : issue understanding, knowledge‑base lookup, > 90% resolution rate.

Complex Research Task Example

Task: Analyze the 2025 global AI Agent market, identify major vendors, technical roadmaps, market shares and forecast the next three years.

Step 1: Task decomposition
├─ Sub‑task 1: Collect 2025 vendor information
├─ Sub‑task 2: Compare technical roadmaps
├─ Sub‑task 3: Gather market‑share data
└─ Sub‑task 4: Forecast 3‑year trends

Step 2: Parallel information collection
├─ Search authoritative reports (Gartner, IDC…)
├─ Browse vendor websites
├─ Retrieve academic papers & blogs
└─ Analyze open‑source community activity

Step 3: Integration & analysis
├─ Cross‑validate multiple sources
├─ Identify key trends
└─ Generate structured analysis report

Step 4: Report generation
├─ Write executive summary
├─ Produce detailed analysis chapters
├─ Create comparative tables & charts
└─ Output final research report

Automated > 20 web‑browsing actions.

Aggregated > 15 authoritative reports.

Generated a full research report with charts in under 5 minutes.

Ops‑Agentic‑Search Product Overview

Deep Retrieval : multi‑agent collaborative progressive search.

Task Execution : end‑to‑end handling of complex multi‑step tasks.

Tool Invocation : built‑in browser, code execution, file operations.

Multimodal Understanding : support for text, image, video and audio.

Knowledge‑Base Integration : seamless connection to enterprise knowledge bases and OpenSearch indexes.

Result Verification : automatic validation of information accuracy and source reliability.

Quick Start

Documentation: https://developer.aliyun.com/article/1708935

Live demo: https://opensearch.console.aliyun.com/cn-shanghai/rag/agentic-search

Conclusion & Outlook

Ops‑Agentic‑Search’s top rank on GAIA demonstrates that AI agents can now operate at human‑expert levels (92.36% accuracy). The open‑source core components and participation in standards such as MCP aim to foster industry progress and enable large‑scale enterprise adoption of AI agents.

AI Agent Enterprise AI Dynamic Planning GAIA benchmark Ops-Agentic-Search Self‑Evolving Skills

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.