Generator-Evaluator Architecture for End-to-End Re-ranking in Information Flow
The paper introduces a Generator‑Evaluator (GE) architecture that end‑to‑end re‑ranks information‑flow items using a pointer‑network seq2seq generator and a reward‑estimating evaluator, jointly optimizing relevance and business utilities such as diversity, traffic control, inter‑group ordering, and fixed‑slot insertion, achieving over 70% better‑percentage and significant online gains on Taobao.
Complex information‑flow recommendation scenarios require not only relevance but also diverse business goals such as diversity, traffic control, multiple presentation formats, and fixed‑slot insertion. Traditional pipeline‑based recommender systems handle these objectives sequentially, leading to conflicts and sub‑optimal performance.
To address this, a novel Generator‑Evaluator (GE) architecture is proposed. The generator, built as a pointer‑network seq2seq model, produces an entire reordered list of items in an end‑to‑end fashion, while the evaluator estimates a reward that combines several utility functions reflecting both relevance and business constraints. The generator is trained with a reinforcement‑learning (REINFORCE) objective that maximizes the evaluator‑derived reward.
The encoder uses a DeepSet structure to obtain permutation‑invariant item embeddings, and the decoder employs a pointer network that selects items one by one, applying masking to prevent reselection and to enforce trigger‑item placement. Sampling strategies such as Thompson Sampling or Random Network Distillation are used to encourage exploration.
Four key business objectives are formalized as utility functions: traffic control, diversity, inter‑group ordering, and fixed‑slot insertion. Each objective is expressed with mathematical formulas that are incorporated into the overall reward used for training.
Training proceeds in two stages: first, the evaluator is pretrained with supervised cross‑entropy loss to predict user feedback scores; second, the generator is optimized via policy gradient, using the difference between the generated sequence’s reward and the logged sequence’s reward as the advantage. Gradient clipping and reward‑shaping techniques are applied.
Experiments on Taobao’s information‑flow feed demonstrate that the GE model improves both relevance and business metrics. The generator achieves a “better‑percentage” of over 70%, meaning it produces sequences with higher evaluator scores than the original online rankings, and the model has been fully deployed with significant online gains.
References to related works (e.g., MiDNN, Seq2Slate, PRM, DLCM, GSF) are provided, and the development team is introduced as the Taobao Private‑Domain User Algorithm team.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.