Deep Ranking Optimization for E-commerce Recommendation
The 2021 Taobao New‑Product team boosted e‑commerce recommendation by redesigning the coarse‑ranking stage with a dual‑tower DSSM, low‑cost feature‑crossing, NOVA attention and multi‑task distillation from a fine‑ranking teacher, delivering up to +30‰ GAUC gain and 3‑5 % online CTR and click improvements.
This article describes the key optimization processes and results of the 2021 Taobao New‑Product algorithm team on deep coarse‑ranking (粗排) for e‑commerce recommendation.
Background : In large‑scale e‑commerce recommendation, the pipeline (recall → coarse‑ranking → fine‑ranking → re‑ranking) must finish within ~300 ms. Coarse‑ranking handles thousands to tens of thousands of candidates and therefore requires high‑performance, low‑latency models, while fine‑ranking works on a few hundred items and can afford more complex models.
Differences between coarse‑ranking and fine‑ranking : • Candidate set size: coarse‑ranking scores up to 5k items, fine‑ranking only the top 600. • Model structure: coarse‑ranking often uses dual‑tower (e.g., DSSM) with simple interactions; fine‑ranking uses richer architectures (e.g., SIM) with extensive feature crossing. • Sample space: coarse‑ranking draws from the whole item pool, fine‑ranking from the truncated coarse‑ranking output.
Coarse‑ranking optimization – user‑attention interaction DSSM : The team first used static rule‑based truncation, then a simple GBDT model, and later a single‑tower WDL model. The most effective solution was a dual‑tower DSSM where the user tower is computed online and the item tower is pre‑computed offline, eliminating version‑consistency issues after the BE3.0 graph‑merge upgrade.
Two‑stage distillation : To bridge the performance gap between coarse‑ranking and fine‑ranking, a teacher‑student framework was introduced. The teacher is the fine‑ranking model; the student is the coarse‑ranking model. In the first stage, soft labels from the teacher are used to train the student (distill_loss). The second stage adds multi‑task learning (soft and hard labels) and low‑cost feature crossing (FM) to further improve the student.
Model improvements : 1. Multi‑task label distillation. 2. Low‑cost FM feature crossing between user behavior sequence and target item. 3. NOVA attention for denoising user tower. 4. SENET + Bilinear interaction for item tower. 5. Replacing the final dot‑product with an MLP.
Offline experiments : Various model variants were evaluated on GAUC. Adding MLP, FM, and distillation yielded up to +30‰ GAUC improvement; combining NOVA, SENET, and FiBiNet‑style interactions added another +10‑13‰.
Online experiments : The deployed two‑stage distillation + tower optimizations achieved: • Overall uplift: uCTR +1.86 %, pCTR +1.68 %, depth +1.83 %, clicks +3.48 %. • For “mindful” users: uCTR +1.52 %, pCTR +2.19 %, depth +3.11 %, clicks +5.33 %.
Further analysis : Synchronous distillation (teacher and student trained jointly) showed mixed results; disabling student updates to the shared embedding improved performance, while allowing the teacher to update user‑attention gave the best gains.
Conclusion and outlook : The study demonstrates that careful architecture design (dual‑tower DSSM), low‑cost feature crossing, and knowledge distillation can significantly improve coarse‑ranking efficiency and effectiveness. Future work includes defining clearer evaluation metrics for coarse‑ranking and exploring richer negative sampling strategies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
