How Reinforcement Learning Revolutionizes E‑commerce Product Ranking

This article details the evolution of AliExpress product ranking from simple DNN scoring to advanced reinforcement‑learning re‑ranking, comparing multiple models, exploring context effects, introducing pointer‑network generators, evaluating various RL algorithms, and reporting significant online gains in conversion and GMV.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Reinforcement Learning Revolutionizes E‑commerce Product Ranking

Product Ranking Re‑ranking

The goal of product ranking is to give high‑quality items better exposure by matching user demand, but classic pointwise scoring ignores contextual effects, leading to suboptimal ordering.

Is scoring without context enough?

Three models were tested: Simple DNN (pointwise), AE rerank (adds global statistical context), and an enhanced model that further incorporates CNN and RNN modules to capture sequence information. Experiments on 54 M daily training samples and 2.5 M test samples showed that models considering context outperform the simple DNN.

Does ordering by original scores remain reliable?

Using a DCG‑like metric that multiplies predicted purchase probability by position discount avoids the flaw of directly summing scores. Experiments demonstrated that direct score‑based ordering only improves the sequence in about 66% of cases, indicating it is not optimal.

Sequence Generator

The generator takes a candidate set of items and outputs an ordered list using a pointer‑network‑style architecture with DNN and LSTM components. It can handle variable candidate sizes and runs twice as fast as a two‑LSTM pointer network.

Reinforcement Learning Path

The evaluator predicts expected purchase count for a sequence and serves as the reward signal for the generator. The problem is framed as a policy‑based RL task, with state = already ordered items + remaining candidates, action = pick next item, reward = predicted purchase probability.

Algorithm Design

Starting from the simple REINFORCE loss, several improvements were explored: subtracting the mean reward, normalizing by reward variance (reinforce_2), clipping extreme values (reinforce_3), and finally using PPO with a critic to estimate the value function. Monte‑Carlo sampling (PPO_MC) was introduced to approximate the value by averaging rewards over multiple sampled permutations.

Online Deployment and Real‑world Impact

Models were retrained daily using two weeks of BTS experiment data and deployed after a short offline training cycle. The GMV‑focused RL model achieved notable price uplift during the National Day promotion, while a combined PAY_GMV model improved both conversion rate and overall GMV.

Summary and Future Outlook

After five months, the RL re‑ranking system proved flexible and effective, especially during major sales events. Future work includes improving the evaluator, exploring GAIL‑style adversarial training, and developing real‑time models to continuously adapt to user behavior.

GAIL Reinforcement Learning Re‑ranking

By adding a discriminator that scores generated sequences lower than original ones, the generator learns to produce high‑reward sequences that also stay close to the training data distribution.

The AUC Dilemma

Offline AUC correlates poorly with online conversion; high AUC models can even hurt UV conversion. Position bias and distribution shift after re‑ranking invalidate AUC as a reliable metric, prompting the use of the RL evaluator as a more consistent offline proxy.

World Model: Sequence Evaluator

Following Joachims' counterfactual evaluation ideas, the sequence evaluator serves as a world model to estimate unbiased rewards, bridging offline training and online performance.

Real‑time Reinforcement Learning

Future directions involve making the evaluator online to provide immediate feedback, or eliminating it altogether for direct interaction with the live environment, though the latter remains challenging.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

e‑commercereinforcement learningonline experimentsproduct rankingsequence generation
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.