Artificial Intelligence 10 min read

Exploring WebDancer: Alibaba’s WebAgent that Solves Complex Queries Automatically

This article walks through installing Alibaba's WebDancer agent, explains its SFT‑plus‑RL training pipeline—including data construction, trajectory sampling, supervised fine‑tuning, and reinforcement learning—compares it with the earlier WebWalker, and demonstrates its multi‑step reasoning on a real‑world query.

xkx's Tech General Store

Sep 10, 2025

Exploring WebDancer: Alibaba’s WebAgent that Solves Complex Queries Automatically

The author continues a series on AI agents, moving from the first‑generation WebWalker to the second‑generation WebDancer, an open‑source ReAct agent that autonomously handles multi‑step information‑seeking tasks.

Installation : Using a Tianyi Cloud research machine (36 CPU cores, 210 GB RAM, 2 × NVIDIA‑A100‑40G), the 32B WebDancer model requires at least 60 GB GPU memory. The environment is created with

conda create -n webdancer python=3.12
pip install -r requirements.txt

. Dependency conflicts are resolved by fixing versions to

sglang[all]==0.4.6.post1
qwen-agent[gui,rag,code_interpreter,mcp]==0.0.29

. The model is downloaded via

modelscope download --model iic/WebDancer-32B --local_dir Alibaba-NLP/WebDancer-32B/

and deployed with a modified deploy_model.sh script that matches tensor‑parallelism to the number of GPUs.

Principle and Architecture : WebDancer is trained with synthetic data and a two‑stage learning process (SFT + RL) as described in its arXiv paper (https://arxiv.org/pdf/2505.22648). Unlike WebWalker, which relies on prompt engineering, WebDancer embeds tool‑use and planning into the model through four steps:

Data Construction – generates large‑scale, high‑quality multi‑hop QA pairs using crawlQA (web crawling) and e2hQA (iterative rewriting of simple questions into complex ones).

Trajectory Sampling – converts QA pairs into ReAct‑format trajectories with Rejection Sampling, employing Short‑CoT (GPT‑4o) for concise reasoning and Long‑CoT (LRM such as QwQ‑Plus) for extended reasoning, followed by three filters to remove malformed, incorrect, or redundant trajectories.

Supervised Fine‑Tuning (SFT) – teaches the model the ReAct XML‑like tags <think>, <tool_call>, <tool_response>, and <answer>, training it via next‑token prediction to internalize tool‑use and reasoning.

Reinforcement Learning (RL) – uses the DAPO algorithm to further optimize decision sequences on synthetic QA data not seen during SFT, improving long‑horizon planning.

Data Construction Details : crawlQA collects root URLs from authoritative sites (arXiv, GitHub, Wikipedia), recursively crawls subpages, extracts content, and generates QA pairs with GPT‑4o, controlling question types (COUNT, MULTI‑HOP, INTERSECTION) via prompts. e2hQA starts from simple QA, extracts entities, searches for related information, reconstructs questions with an LLM, and iterates to produce complex, multi‑step queries. An example generated question is: “Which game ranked fourth in the Godot XR Game Jam February 2025 but was not featured in the 2024 Godot Games showreel?” Traditional search engines and non‑connected LLMs cannot answer it directly.

Case Test : After deployment, the provided demo script (

cd scripts
bash run_demo.sh

) launches a web UI. The author queries the agent with a convoluted Chinese football question. The displayed execution trace shows the four ReAct stages: <Think> (planning), tool‑search invocation, retrieval of ten search results, and final answer generation. A comparison with a generic search‑engine AI reveals mismatches, highlighting WebDancer’s more structured reasoning.

Conclusion : WebDancer demonstrates how SFT and RL can endow an LLM with ReAct capabilities, differing from earlier prompt‑engineering approaches. While its open‑source performance lags behind commercial solutions such as OpenAI’s DR, the project provides a transparent pipeline and benchmark data for further research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba ReAct AI Agent reinforcement learning SFT LLM Agents WebDancer

Written by

xkx's Tech General Store

Code with the left hand, enjoy with the right; a keystroke sweeps away worries.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.