Reinforcement Learning for Pacing in Preloaded Ads (RLTP)
The paper introduces RLTP, a reinforcement‑learning‑based pacing system that models delayed‑impression preloaded ads as an MDP, uses a dueling DQN to select traffic probabilities, and simultaneously meets exposure targets, ensures smooth delivery, and maximizes CTR, outperforming rule‑based and PID baselines while removing complex multi‑stage pipelines.
This article presents the algorithmic practice of using reinforcement learning (RL) to address delayed impression problems in preloaded advertising, a solution that was published in KDD 2023.
Paper: RLTP: Reinforcement Learning to Pace for Delayed Impression Modeling in Preloaded Ads ( PDF )
1. Background Preloaded ads are a common strategy in splash advertising where the ad shown in the current request was actually filled in a previous request. Because the media side decides the final exposure, the advertiser can only control whether to select a request, not whether the ad will finally be displayed. This creates a delayed‑impression challenge: the observed exposure count at any moment is incomplete, leading to potential over‑delivery or under‑delivery if pacing is based solely on observed data.
2. Motivation The delayed feedback and the black‑box nature of the media’s preloading policy make traditional PID‑based pacing unreliable. RL, which maximizes long‑term reward through interaction with an environment, is well‑suited to handle these issues.
3. Problem Formalization The goal is to meet two objectives at the end of a campaign: (1) achieve the target exposure volume without excessive overshoot and with smooth delivery, and (2) maximize one‑click‑through‑rate (CTR) as a measure of effectiveness. The delayed‑impression phenomenon is modeled as a Markov Decision Process (MDP) where each time window’s state reflects the post‑window delivery status, the action is the traffic‑selection probability, and the reward combines four components: volume‑preservation, overshoot penalty, smoothness, and CTR maximization.
4. RLTP Framework
4.1 State Representation States consist of statistical features (cumulative observed exposures, click counts, derived ratios), context features (week, hour, minute), and user/ad embeddings (pre‑trained and frozen).
4.2 Action Space Both discrete (step size 0.02) and continuous action spaces are explored; the discrete version is used in the main RLTP model.
4.3 Reward Estimator Four weighted rewards are summed: (a) encourage meeting the exposure target, (b) penalize overshoot heavily, (c) reward smooth probability changes across windows, and (d) reward higher CTR than a baseline. The weights are learned as parameters.
4.4 Network Architecture A dueling DQN approximates the value and advantage functions for the large state‑action space. The network receives the state vector and outputs Q‑values for each discrete action.
5. Experiments
Offline simulators built from a week of historical logs (5‑minute windows, 288 steps per day) were used for training and evaluation. Baselines include rule‑based truncation, prediction‑plus‑PID, and a continuous‑action RLTP variant (PPO). RLTP (discrete) and RLTP‑continuous achieve comparable performance, both surpassing baselines in delivery completion rate, CTR, and cumulative reward.
Training converges after roughly 30k episodes, yielding stable cumulative reward and high delivery completion. Ablation studies show that removing the CTR‑maximization reward degrades both CTR and the alignment between selected traffic probability and observed CTR.
6. Conclusion By directly learning a pacing policy via RL, the end‑to‑end RLTP framework eliminates the need for multi‑stage pipelines (prediction + rules + PID) and successfully meets exposure‑preservation and performance goals in delayed‑impression scenarios.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.