How Alibaba Cloud’s Ops‑Agentic‑Search Reached Human‑Level Performance on the GAIA Benchmark

Alibaba Cloud’s AI Search team introduces Ops‑Agentic‑Search, an enterprise‑grade AI agent framework that tackles core challenges of hallucination, task failure, and long‑term consistency, leverages the GAIA benchmark to demonstrate a 92.36% accuracy—matching human experts—and outlines its technical architecture, key mechanisms, use cases, and future open‑source contributions.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Alibaba Cloud’s Ops‑Agentic‑Search Reached Human‑Level Performance on the GAIA Benchmark

Background

With the rapid rise of large language models, AI systems are transitioning from passive response to proactive execution. Agents serve as the core vehicle for this shift, enabling autonomous perception, goal decomposition, tool invocation, and iterative action, thereby extending single‑turn inference to end‑to‑end task completion.

Scaling agent capabilities faces several hurdles: hallucination amplification across steps, high task‑failure rates, consistency drift in long‑running tasks, and unreliable tool integration.

GAIA Benchmark

GAIA (General AI Assistants Benchmark), co‑created by Meta AI, Hugging Face and other research groups, provides a comprehensive evaluation of agent abilities across 466 real‑world tasks covering reasoning, multimodal processing, web browsing, and tool use. Only 300 of these are hidden for a global leaderboard. Current state‑of‑the‑art models such as GPT‑4 achieve average scores below 30%, while human experts score around 92%.

Ops‑Agentic‑Search Framework

Ops‑Agentic‑Search is Alibaba Cloud OpenSearch’s enterprise‑grade agent framework. It tightly integrates OpenSearch’s powerful search engine with a full‑stack reasoning loop that includes task understanding, dynamic planning, tool execution, feedback iteration, and evaluation.

Core capability matrix includes:

Multimodal understanding (native support for text, image, video, audio)

Browser automation (BrowserUse) for autonomous web browsing and information extraction

Code execution (CodeAgent) for Python/Shell generation and execution

File operations for local read/write

MCP protocol compatibility for ecosystem integration

Self‑evolving Skills system that automatically extracts and refines reusable skills

Key Techniques

1. Global Dynamic Planning (plan_with_files) – The plan_with_files mechanism externalizes planning steps, intermediate results, and execution state into files, decoupling task length from context window limits and enabling ultra‑long tasks.

2. Self‑Reflection – Agents continuously evaluate their own outputs, identify errors, and adjust strategies. The flow is illustrated below:

执行输出 → 交叉验证 → 错误识别 → 策略调整 → 重新执行

3. Dynamic Context Management – Two complementary strategies keep the context window efficient:

Summary strategy : semantic compression that retains key reasoning nodes while converting redundant content into concise summaries, ideal for long dialogues.

Discard strategy : evaluates timeliness, relevance, and dependency to drop low‑priority information when the window is full.

4. Self‑Evolving Skills – A closed loop of execution → extraction → application → re‑extraction continuously improves skill quality, allowing agents to skip repetitive reasoning for similar tasks.

Use Cases and Example

Typical scenarios demonstrated include enterprise knowledge Q&A (accuracy >92%), market‑research report generation (10× efficiency), code‑assisted development (50% faster), data‑analysis reporting (minutes instead of days), and automated customer service (resolution >90%).

Case study: Complex research task – The agent was tasked with analyzing the 2025 global AI‑Agent market, covering vendors, technology routes, market share, and three‑year forecasts. The workflow involved task decomposition, parallel information gathering, cross‑validation, and report generation. Results:

20+ automated web‑browsing actions

Integration of 15+ authoritative reports

Full research report with charts produced in under 5 minutes

Product Overview

AgenticSearch, the AI search paradigm launched by Alibaba Cloud OpenSearch, combines deep retrieval, multi‑step reasoning, tool calling, and multimodal understanding to achieve a historic breakthrough: topping the GAIA leaderboard with a 92.36% accuracy, matching human expert performance.

Future Directions and Contributions

The core technologies will be gradually open‑sourced to foster industry progress, and Alibaba Cloud will actively participate in standards such as the MCP protocol. Deep integration with Bailei, DingTalk and other Alibaba Cloud services aims to build a comprehensive agent ecosystem.

multimodalOpenSearchEnterprise AIDynamic PlanningGAIA benchmark
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.