How Alibaba Cloud’s Ops‑Agentic‑Search Reached Human‑Level Performance on the GAIA Benchmark
Alibaba Cloud’s AI Search team introduces Ops‑Agentic‑Search, an enterprise‑grade AI agent framework that tackles core challenges of hallucination, task failure, and long‑term consistency, leverages the GAIA benchmark to demonstrate a 92.36% accuracy—matching human experts—and outlines its technical architecture, key mechanisms, use cases, and future open‑source contributions.
Background
With the rapid rise of large language models, AI systems are transitioning from passive response to proactive execution. Agents serve as the core vehicle for this shift, enabling autonomous perception, goal decomposition, tool invocation, and iterative action, thereby extending single‑turn inference to end‑to‑end task completion.
Scaling agent capabilities faces several hurdles: hallucination amplification across steps, high task‑failure rates, consistency drift in long‑running tasks, and unreliable tool integration.
GAIA Benchmark
GAIA (General AI Assistants Benchmark), co‑created by Meta AI, Hugging Face and other research groups, provides a comprehensive evaluation of agent abilities across 466 real‑world tasks covering reasoning, multimodal processing, web browsing, and tool use. Only 300 of these are hidden for a global leaderboard. Current state‑of‑the‑art models such as GPT‑4 achieve average scores below 30%, while human experts score around 92%.
Ops‑Agentic‑Search Framework
Ops‑Agentic‑Search is Alibaba Cloud OpenSearch’s enterprise‑grade agent framework. It tightly integrates OpenSearch’s powerful search engine with a full‑stack reasoning loop that includes task understanding, dynamic planning, tool execution, feedback iteration, and evaluation.
Core capability matrix includes:
Multimodal understanding (native support for text, image, video, audio)
Browser automation (BrowserUse) for autonomous web browsing and information extraction
Code execution (CodeAgent) for Python/Shell generation and execution
File operations for local read/write
MCP protocol compatibility for ecosystem integration
Self‑evolving Skills system that automatically extracts and refines reusable skills
Key Techniques
1. Global Dynamic Planning (plan_with_files) – The plan_with_files mechanism externalizes planning steps, intermediate results, and execution state into files, decoupling task length from context window limits and enabling ultra‑long tasks.
2. Self‑Reflection – Agents continuously evaluate their own outputs, identify errors, and adjust strategies. The flow is illustrated below:
执行输出 → 交叉验证 → 错误识别 → 策略调整 → 重新执行3. Dynamic Context Management – Two complementary strategies keep the context window efficient:
Summary strategy : semantic compression that retains key reasoning nodes while converting redundant content into concise summaries, ideal for long dialogues.
Discard strategy : evaluates timeliness, relevance, and dependency to drop low‑priority information when the window is full.
4. Self‑Evolving Skills – A closed loop of execution → extraction → application → re‑extraction continuously improves skill quality, allowing agents to skip repetitive reasoning for similar tasks.
Use Cases and Example
Typical scenarios demonstrated include enterprise knowledge Q&A (accuracy >92%), market‑research report generation (10× efficiency), code‑assisted development (50% faster), data‑analysis reporting (minutes instead of days), and automated customer service (resolution >90%).
Case study: Complex research task – The agent was tasked with analyzing the 2025 global AI‑Agent market, covering vendors, technology routes, market share, and three‑year forecasts. The workflow involved task decomposition, parallel information gathering, cross‑validation, and report generation. Results:
20+ automated web‑browsing actions
Integration of 15+ authoritative reports
Full research report with charts produced in under 5 minutes
Product Overview
AgenticSearch, the AI search paradigm launched by Alibaba Cloud OpenSearch, combines deep retrieval, multi‑step reasoning, tool calling, and multimodal understanding to achieve a historic breakthrough: topping the GAIA leaderboard with a 92.36% accuracy, matching human expert performance.
Future Directions and Contributions
The core technologies will be gradually open‑sourced to foster industry progress, and Alibaba Cloud will actively participate in standards such as the MCP protocol. Deep integration with Bailei, DingTalk and other Alibaba Cloud services aims to build a comprehensive agent ecosystem.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
