Artificial Intelligence 11 min read

Exploring Alibaba’s WebWalker Agent: Can It Identify the Current Premier League Leader?

This article walks through the installation, architecture, dataset, experimental results, and real‑world test cases of Alibaba’s open‑source WebWalker agent, demonstrating how it performs vertical deep web retrieval and evaluating its strengths and limitations compared with baseline methods.

xkx's Tech General Store

Sep 7, 2025

Exploring Alibaba’s WebWalker Agent: Can It Identify the Current Premier League Leader?

In this installment of the "Agent Exploration" series, the author moves from the previous OpenManus experience to a hands‑on review of Alibaba’s WebAgent family, focusing on the first‑generation WebWalker, an open‑source LLM‑driven web‑automation system.

Installation

The setup is straightforward: create a Python 3.10 conda environment, clone the repository, install requirements, configure the large‑model API key, and launch the Streamlit UI.

conda create -n webwalker python=3.10
git clone https://github.com/alibaba-nlp/WebWalker.git
cd WebWalker
pip install -r requirements.txt
crawl4ai-setup
crawl4ai-doctor
export OPEN_AI_API_KEY=YOUR_API_KEY
export OPEN_AI_API_BASE_URL=YOUR_API_BASE_URL
# or
export DASHSCOPE_API_KEY=YOUR_API_KEY
cd src
streamlit run app.py

Overall Architecture

WebWalker is designed for "vertical deep information retrieval". It enables an LLM to systematically click links, drill down through multiple pages, and synthesize information to answer complex questions that cannot be satisfied by a simple search. The accompanying paper (https://arxiv.org/pdf/2501.07572) introduces the WebWalkerQA dataset, which uses QA pairs rather than action sequences, and distinguishes single‑source (all needed information hidden along one vertical path) from multi‑source (information scattered across independent branches) queries. The dataset evaluates depth, width, and hop dimensions.

Agent Design

WebWalker consists of two agents:

Explorer Agent follows a Thought‑Action‑Observation loop to explore pages.

Critic Agent stores useful observations, decides whether the collected information suffices, and either continues exploration or outputs an answer.

The Explorer’s step‑by‑step process is:

Receive the current page observation.

Reason based on history (Think).

Select the next link to click (Action).

Navigate to the new page and obtain a fresh observation.

Update the context.

Repeat until the Critic signals completion or a step limit is reached.

The Critic evaluates the accumulated memory, determines if the original question can be answered, and decides to continue or stop.

Experimental Results

The paper compares WebWalker with ReAct and Reflexion using two metrics: answer accuracy (higher is better) and average action count (lower is better). Findings include:

WebWalker generally outperforms the baselines, though the highest accuracy remains modest, indicating a challenging task.

For open‑source models, larger parameter scales yield better performance.

Multi‑source and multi‑step queries are significantly harder.

Performance on Chinese and English queries is comparable.

A follow‑up experiment shows that a pure Retrieval‑Augmented Generation (RAG) approach caps accuracy at about 40 %, while combining RAG with WebWalker raises the score.

Case Tests

The author modifies app.py to add support for DeepSeek models and demonstrates two scenarios:

Querying the ACL 2025 conference website for the paper‑submission deadline and venue address.

Asking WebWalker to locate the current Premier League leader on Sina Sports.

Both tests show the agent’s ability to navigate, observe, and synthesize answers, with screenshots confirming correct results. A Windows‑specific fix (

asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

) is also noted.

Conclusion

WebWalker, released in January 2025, demonstrates promising capabilities for deep web browsing and information extraction, especially on single‑source queries. However, accuracy on multi‑source problems remains low (<20 %), and further advances in LLM capability and integration with tools like RAG are needed before it can replace traditional RPA or automated testing workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba LLM RAG agent Dataset Web Automation WebWalker

Written by

xkx's Tech General Store

Code with the left hand, enjoy with the right; a keystroke sweeps away worries.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.