How LLM‑Powered AI Transforms Taobao Product Selection: From DeepSearch to Agentic RL

This article analyzes the challenges of traditional product selection on Taobao and presents an LLM‑driven solution that combines multi‑round online search, DeepSearch vs. WideSearch strategies, sample construction, SFT and RL training, and shows experimental results that improve relevance, diversity, and efficiency of the selected product set.

Alimama Tech

Apr 9, 2026

How LLM‑Powered AI Transforms Taobao Product Selection: From DeepSearch to Agentic RL

Overview

Traditional Taobao product selection relies on manual rule‑based pipelines, which are complex, inefficient, poorly aligned with seasonal scenarios, and produce low‑quality selections.

LLM‑Based Product Selection Framework

The system accepts a natural‑language intent (full idea, theme, hot keyword, or image) and uses a large language model (LLM) to parse the request, generate search terms, and retrieve high‑relevance, high‑potential items through a coarse‑to‑fine filtering pipeline.

Online Search Strategies

Two algorithms are explored:

DeepSearch : an iterative “search‑read‑reason” loop. Each round performs a web search, reads the result, and uses a ReAct reasoning step to decide the next query. This improves information depth but incurs high latency (~140 s) and uncontrolled loop counts.

WideSearch : a plan‑and‑execute approach. A planning module generates a set of parallel search terms, which are executed concurrently. It achieves higher information density with low latency (≈4 s for a single‑turn query, 24 s for WideSearch, 6 s for the SFT‑tuned variant).

Search Efficiency Comparison

Key differences:

Search mode: single (normal), multi‑round (DeepSearch), parallel (WideSearch).

Information density: low for normal, high for WideSearch.

Latency: 4 s (normal), 140 s (DeepSearch), 24 s (WideSearch), 6 s (WideSearch‑SFT).

Sample Construction for Supervised Fine‑Tuning (SFT)

Because real labeled data are scarce, a multi‑agent pipeline generates ~10 k training instances:

Multi‑Agent collaboration : a search‑agent performs web retrieval, a demand‑analysis‑agent extracts user intent, and a term‑generation‑agent produces candidate search terms.

Quality filtering : an LLM‑as‑a‑Judge evaluates each sample; low‑quality samples are discarded or regenerated, achieving ~75 % pass rate.

Semantic deduplication : semantic hashing removes near‑duplicate entries, reducing redundancy by ~30 %.

Supervised Fine‑Tuning (SFT)

The filtered dataset fine‑tunes the base LLM (e.g., Tbstars‑42B‑A3.5B). After SFT the model generates more concise, Taobao‑compatible search terms, but a gap remains between LLM‑generated terms and actual user queries.

Reinforcement Learning for Search Term Optimization (SearchRL)

SearchRL replaces offline SFT with online reinforcement learning using the GRPO algorithm. The reward function combines three components:

Semantic score : a 7 B judge model evaluates completeness and conciseness of the term.

Relevance score : cosine similarity between the term embedding (BGE) and product titles.

Quantity score : number of retrieved items; a curriculum‑learning schedule gradually raises the reward ceiling from 1 000 to 10 000 items.

This approach does not require manually labeled terms; the reward is derived from actual Taobao search results.

Agentic RL for Whole‑Strategy Optimization

Beyond individual terms, Agentic RL optimizes an entire selection strategy consisting of positive and negative search terms. Workflow:

LLM generates a JSON strategy, e.g.

{"positiveSearch": ["termA", "termB"], "negativeSearch": ["termX"]}

A lightweight TinyPreview tool simulates product recall on a sampled subset of the 40 B catalog.

The simulated recall is scored with the same reward components and fed back to the LLM.

Multiple rollouts form a trajectory; the RL algorithm updates the policy to maximize trajectory reward.

Experimental Results

Models were warm‑started with 5 k SFT samples, then trained with 17 k unlabeled examples using the ROLL framework. Representative metrics are shown below:

Model                              First‑Turn  Final   Improvement  Avg‑Turn  Max‑Reward‑Turn  High‑Reward %
Baseline (single‑turn)               4.3        –       –            1.0       1.0               19%
tbstars‑42B‑A3.5B‑ds‑sft            4.4        5.75    31%           2.9       2.0               21%
Qwen‑2.5‑72B‑Instruct‑ds‑sft       4.8        6.2     29%           2.9       2.0               28%
tbstars‑42B‑A3.5B‑ds‑rl            5.6        6.2     11%           2.7       1.5               35%

Findings:

Both DeepSearch SFT and RL improve the first‑turn reward.

RL yields the largest boost, bringing the 42 B model on par with the larger 72 B Qwen‑2.5 model.

Higher‑reward strategies increase the proportion of high‑quality selections, potentially raising manual adoption rates.

RL’s overall turn‑count reduction is modest compared with SFT, indicating diminishing returns after early iterations.

Conclusion

The LLM‑driven product‑selection pipeline substantially improves relevance, diversity, and efficiency over traditional rule‑based methods. Future work should shift from optimizing intermediate metrics (diversity, relevance) to directly optimizing commercial effectiveness (品效) of the selected set via end‑to‑end Agentic RL.