Artificial Intelligence 34 min read

Real-time Controllable Multi-Objective Re-ranking Models for Taobao Feed Recommendation

The paper introduces a real‑time controllable, multi‑objective re‑ranking framework for Taobao’s feed recommendation that combines actor‑critic reinforcement learning with hypernetworks to instantly adjust objective weights, handling diverse media and cold‑start constraints while delivering higher click‑through, diversity, and cold‑start ratios with only 20‑25 ms latency.

Sohu Tech Products

Dec 6, 2023

Real-time Controllable Multi-Objective Re-ranking Models for Taobao Feed Recommendation

This document presents a comprehensive study of re‑ranking models that support complex, multi‑objective optimization and real‑time weight adjustment in the Taobao (mobile) feed recommendation scenario.

Main parts:

1. Challenges of the feed scenario and the unique advantages of re‑ranking models.

2. Summary of re‑ranking modeling paradigms.

3. Integration of multiple objectives into re‑ranking.

4. Real‑time controllable re‑ranking based on hypernetworks.

5. Experimental results and online A/B test outcomes.

6. Q&A session.

1. Challenges in the feed scenario and advantages of re‑ranking

In Taobao’s feed pipeline, billions of candidates are first recalled to a pool of millions, then filtered by coarse‑ranking, fine‑ranking and finally a re‑ranking stage that scores dozens to hundreds of items. The earlier stages require extremely high efficiency because of massive scoring volume, while later stages can afford more computation for higher precision. Re‑ranking, first introduced by Alibaba in 2018, differs from previous stages by modeling context information —the mutual influence among items.

Typical challenges include:

Scattered layout: avoid clustering of shops, categories, or content types.

Cold‑start control: guarantee a certain proportion of newly published items.

Mixed media: combine live streams, products, images, and videos, each with different feature spaces.

Multi‑supply fusion: handle items from diverse production pipelines (e.g., different live‑stream sources).

Multi‑objective optimization: balance user experience, efficiency, merchant ecosystem, and business‑specific goals.

Re‑ranking’s context perception and control capabilities directly address these challenges.

2. Modeling paradigms (V1, V2, V3)

Three re‑ranking paradigms are described:

V1 : Captures partial context via RNN/Transformer, assigns a single score per item, and performs greedy sorting. It lacks explicit control over context (e.g., shop dispersion).

V2 : Adds a sequential selection mechanism. An encoder produces a global context embedding; at each step a state vector attends to remaining candidates, allowing the model to enforce context constraints such as shop dispersion. Training requires supervised labels for the optimal next item, which is difficult to obtain for complex scenes.

V3 : Introduces a reward‑driven training (reinforcement learning). An actor generates a sequence given user and candidate items; an evaluator scores the sequence with a reward function that aggregates multiple utilities (click‑through rate, cold‑start ratio, diversity, etc.). The reward is differentiable without requiring explicit labels, enabling end‑to‑end optimization of multi‑objective goals.

3. Actor‑critic framework

The actor generates a candidate sequence; the evaluator computes a reward and provides a gradient to the actor. The reward can incorporate any utility, even non‑differentiable ones such as shop dispersion count, because the learning signal comes from policy gradient methods rather than direct gradient of the utility.

Training loss consists of two parts: (1) a reward‑weighted term encouraging high‑reward sequences, and (2) a probability term that maximizes the likelihood of the generated sequence. Sampling during training introduces stochasticity, allowing the actor to explore diverse strategies.

4. Real‑time controllable re‑ranking with hypernetworks

To enable on‑the‑fly adjustment of objective weights w, a hypernetwork predicts the subset of model parameters θ_w that are sensitive to w. The rest of the parameters remain static. At serving time, a user‑specific or traffic‑specific weight vector is supplied; the hypernetwork instantly generates the corresponding θ_w, and the re‑ranking model produces the optimal sequence for that weight configuration.

This design eliminates the need to retrain the entire model when changing preferences, allowing rapid A/B testing and dynamic adaptation during high‑traffic events.

During offline training, each sample is paired with a randomly drawn weight vector w (uniformly from [0,1]) to ensure the model can handle any weight configuration at inference.

5. Experimental results

Online experiments compare the hypernetwork‑enabled controllable re‑ranking against the traditional pipeline baseline. Metrics such as click‑through rate, cold‑start ratio, shop diversity, and group ordering all show improvements when the weight vector is tuned in real time. The P99 latency of the re‑ranking service is around 20‑25 ms, demonstrating that the model remains lightweight despite its expressive power.

6. Q&A highlights

Weight vectors are typically set at the user‑group level (e.g., new vs. mature users).

During offline training, each batch samples a weight w; the same distribution is used online.

Hard constraints (e.g., forced top‑position items) are enforced via attention masking.

Mixed‑media features are concatenated with zero‑padding; the model learns to handle heterogeneous inputs.

Offline evaluation uses the evaluator’s reward and a “better‑percentage” metric (the proportion of generated sequences that outperform the current online baseline, expected > 50%).

Conclusion

The presented controllable multi‑objective re‑ranking framework successfully replaces the rigid pipeline architecture, offering flexible, real‑time weight adjustment, superior multi‑objective performance, and low latency suitable for large‑scale e‑commerce feed recommendation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Recommendation Systems reinforcement learning real-time control multi-objective optimization Re‑ranking hypernetworks

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.