Actor‑Critic Reinforcement Learning for Real‑Time Bidding in Mobile Game Advertising
The paper proposes an actor‑critic reinforcement‑learning model (ACRL) that leverages PPO and a deep structured semantic model to optimize real‑time bidding strategies for mobile game ads under CPM and budget constraints, addressing long user lifecycles and sparse conversion data while demonstrably improving ROI in both offline simulations and online A/B tests.
Online display advertising now generates billions of dollars in revenue, and real‑time bidding (RTB) has become a dominant paradigm for matching ads to users. In the mobile‑game context, user lifecycles are longer and conversion events (downloads, in‑app purchases) are delayed and sparse, making it difficult for advertisers to estimate the value of each impression.
To maximize total revenue under CPM and budget limits, the authors introduce a novel reinforcement‑learning framework called ACRL. The model combines a Proximal Policy Optimization (PPO)‑based actor‑critic architecture with a Deep Structured Semantic Model (DSSM) that embeds audience and media features, allowing the actor to output a two‑dimensional Gaussian distribution that represents both win probability and bid quality.
Problem definition: given a set of impression opportunities \(N\), each with potential value \(v_i\) and cost \(c_i\), the objective is to maximize \(\sum_i v_i p_i\) subject to total budget \(C\) and minimum exposure \(K\). The authors extend the positive‑sample definition by incorporating in‑app purchase amount, shallow conversion value, and historical media purchase value, enabling richer reward signals.
The state space \(\mathcal{S}\) includes features such as predicted click‑through rate (pCTR), predicted conversion rate (pCVR), ARPU, and media slot identifiers. The action space \(\mathcal{A}\) is a continuous vector that determines whether to bid. Reward at time \(t\) is defined as \(r_t = \sum_{i\in\mathcal{I}_t} v_i p_i\), with a discount factor \(\gamma = 1\). The policy \(\pi\) maps states to a Gaussian distribution, from which actions are sampled.
The training algorithm follows PPO with clipped surrogate objectives and importance‑sampling weights based on expected CPM (ECPM). Ratio clipping (\(\epsilon = 0.2\)) stabilizes updates, and training proceeds until convergence. Pseudocode and the clipping function are illustrated in the accompanying figure.
For online deployment, the actor outputs two values: \(R_1\) (probability of winning the impression) and \(R_2\) (score of the impression). Bidding adjustments are computed as \((1 + r(u,a)) * \text{ECPM}\), where \(r(u,a)\) is derived from the ROI estimate \(f(u,a)\) and the score \(R_2\). This mechanism guides the real‑time bidding system to raise bids on high‑quality traffic and lower them on low‑quality traffic.
Experiments use real bidding logs from Tencent's ad platform. The actor network has two hidden layers (400 and 100 units); the critic has layers of size 200‑100. Hyper‑parameters (learning rates, batch size, replay memory, etc.) are listed in the paper. Offline evaluation measures the ratio \(R/R^*\) against an oracle optimum, and ACRL consistently achieves the highest ratio and fastest convergence compared with DQN, TRPO, DDQN, TD3, and A3C. Additional ablation studies confirm the benefits of the clipping function, DSSM embeddings, and Gaussian‑based importance sampling.
Online A/B tests on the production platform show an 83.89% increase in ROI and an 89.57% rise in impressions without sacrificing efficiency. The ACRL system has been fully deployed since late 2021, influencing billions of daily ad requests for the mobile‑game market.
In conclusion, the study demonstrates that a PPO‑based actor‑critic model with Gaussian action sampling and context‑aware embeddings can effectively optimize RTB bidding strategies under realistic KPI constraints, delivering substantial gains in both offline simulations and live traffic.
IEG Growth Platform Technology Team
Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.