Artificial Intelligence 30 min read

Tackling Pseudo-Exposure in Mobile E-Commerce: A Contextual Multiple-Play Bandit Approach

To address the pseudo-exposure problem that reduces click-through rates in mobile e-commerce recommendation, the authors model the task as a contextual multiple-play bandit, propose weighted sample and similarity-enhanced linear reward extensions, provide sublinear regret proofs, and demonstrate significant CTR gains on real Taobao data.

Alibaba Cloud Developer

Dec 12, 2018

Tackling Pseudo-Exposure in Mobile E-Commerce: A Contextual Multiple-Play Bandit Approach

Introduction

Online recommendation services often suffer from the pseudo‑exposure problem: many items are shown to users on mobile devices but are never actually viewed, leading to biased negative samples and reduced click‑through rate (CTR). To solve this, we model the recommendation task as a contextual multiple‑play bandit and propose several improvements.

Key Contributions

We formulate online recommendation as an innovative Contextual Multiple‑Play Bandit problem and employ a linear reward model to handle large‑scale candidate items.

We introduce a method that uses the probability of an item being seen as a weight for each sample, mitigating pseudo‑exposure bias.

We incorporate item similarity (cosine similarity) into a hybrid linear model to promote diverse recommendations and improve CTR.

We provide theoretical analysis showing sublinear regret for the proposed algorithms.

Experiments on a real Taobao dataset demonstrate CTR improvements of 15.1% and 12.9% over Learning‑to‑Rank (LTR) under two evaluation metrics, outperforming existing contextual bandit baselines.

Problem Description

In mobile e‑commerce, only a few items are visible on the screen at a time. Users often click on the first few items and may leave the scene before seeing the rest, creating pseudo‑exposure where the system cannot tell whether later items were viewed. Assuming all displayed items were examined leads to biased negative samples and degrades learning.

Formulation

The task is cast as a Contextual Multiple‑Play Bandit problem. At each round t, the algorithm observes a user context u(t) and a set of m candidate arms A(t) with d‑dimensional feature vectors x(t,a). It selects a subset S(t) of K items, receives rewards r(t,a) for the selected items, and updates its policy based on (R,x,L). The linear reward model is used to estimate expected CTR.

Algorithm

We first combine the Independent Bandit Algorithm (IBA) with a linear reward model (LinUCB) to obtain IBALinUCB . To handle pseudo‑exposure, we estimate the exposure probability w(l) for each position using a Position‑Based Model (PBM) and use it as a sample weight in the linear update ( IBALinUCB+SW ).

We further integrate item similarity via a cosine similarity vector z(a) into a hybrid linear model, yielding IBAHybridUCB+SW , which selects items based on an upper confidence bound that accounts for both context and similarity.

Theoretical Analysis

We prove sublinear regret for both IBALinUCB+SW and IBAHybridUCB+SW under the contextual multiple‑play bandit setting. The regret bounds are derived for the cumulative sum of rewards and for the set‑based reward, showing that the algorithms converge to the optimal K‑item selection with high probability.

Experiments

We evaluate on a real Taobao dataset containing 180k user sessions, each with 12 displayed items out of >900 candidates and 56‑dimensional context vectors. Evaluation metrics include cumulative CTR (sum of clicks) and set CTR (whether at least one item is clicked). Baselines: PBM‑UCB, LinUCB‑k, RBA+LinUCB, Learning‑to‑Rank (LTR), and the original IBALinUCB.

Results show that IBAHybridUCB+SW consistently outperforms baselines, achieving up to 15.1% and 12.9% CTR improvements over LTR under the two metrics. Performance is robust across different numbers of recommended items K and exploration parameter α.

Conclusion

The paper presents a comprehensive solution to the pseudo‑exposure issue in large‑scale mobile e‑commerce recommendation by modeling it as a contextual multiple‑play bandit, introducing exposure‑aware sample weighting, and leveraging item similarity. Theoretical guarantees and extensive experiments on real data confirm significant CTR gains over existing methods.

References

Peter Auer. 2002. Using confidence bounds for exploitation‑exploration trade‑offs. JMLR 3, 397–42.

Sébastien Bubeck et al. 2009. Online Optimization in X‑armed bandits. NIPS 201–208.

Wei Chu et al. 2011. Contextual Bandits with Linear Payoff Functions. AISTATS 208–214.

Alexandr Chuklin et al. 2015. Click models for web search. Synthesis Lectures 7, 1–115.

Nick Craswell et al. 2008. An experimental comparison of click position‑bias models. WSDM 87–94.

Thorsten Joachims et al. 2006. Accurately interpreting clickthrough data as implicit feedback. SIGIR Forum 4–11.

Sham M Kakade et al. 2008. Efficient bandit algorithms for online multiclass prediction. ICML 440–447.

Sumeet Katariya et al. 2016. DCM bandits: Learning to rank with multiple clicks. ICML 1215–1224.

Robert Kleinberg et al. 2008. Multi‑armed bandits in metric spaces. STOC 681–690.

Pushmeet Kohli et al. 2013. A Fast Bandit Algorithm for Recommendation to Users with Heterogenous Tastes. AAAI 1135–1141.

Junpei Komiyama et al. 2015. Optimal Regret Analysis of Thompson Sampling in Stochastic Multi‑armed Bandit Problem with Multiple Plays. ICML 1152–1161.

Junpei Komiyama et al. 2017. Position‑based Multiple‑play Bandit Problem with Unknown Position Bias. NIPS 4998–5008.

Branislav Kveton et al. 2015. Cascading bandits: Learning to rank in the cascade model. ICML 767–776.

Branislav Kveton et al. 2015. Combinatorial cascading bandits. NIPS 1450–1458.

Paul Lagrée et al. 2016. Multiple‑play bandits in the position‑based model. NIPS 1597–1605.

Lihong Li et al. 2010. A contextual‑bandit approach to personalized news article recommendation. WWW 661–670.

Yingce Xia et al. 2016. Budgeted multi‑armed bandits with multiple plays. IJCAI 2210–2216.

Datong P Zhou & Claire J Tomlin. 2017. Budget‑Constrained Multi‑Armed Bandits with Multiple Plays. arXiv 1711.05928.

Shi Zong et al. 2016. Cascading bandits for large‑scale recommendation problems. UAI 835–844.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Bandit Algorithms CTR optimization online recommendation contextual multi-play pseudo-exposure

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.