How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking
Tencent's ad tech team redesigned its ad ranking system by adding a parallel user‑experience‑optimized pipeline and evolving from manual CEM tuning to DDPG‑based reinforcement learning, achieving 10‑20% improvements in CTR, repeat‑view rates, and other experience metrics while maintaining overall spend.
Background and Problem
Traditional ad ranking in Tencent’s platform prioritized short‑term eCPM maximization, often sacrificing user experience and producing a "non‑recommendation" feel. High‑value but low‑experience ads could dominate impressions, harming metrics such as click‑through rate (CTR), repeat views, and fast‑scroll rates.
New Parallel Ranking Mechanism
The team introduced a parallel ranking track explicitly optimized for user experience. This redesign incorporated multi‑objective ranking formulas with controllable parameters for revenue and experience, atomic target factors, and fine‑grained control at both user‑segment and request levels.
Evolution of Parameter Tuning
Initially, parameters were tuned manually.
Later, an automated CEM (Cross‑Entropy Method) search replaced manual tuning.
Most recently, reinforcement learning (RL) has been piloted in public accounts, improving contextual relevance by over 15% while keeping overall consumption stable.
First‑Stage Control Algorithm: CEM Search
The CEM approach was enhanced to run at hour‑level granularity using a simulation system, enabling rapid parameter discovery. Reward functions initially treated all metrics equally, but later distinguished core metrics from guard‑rail metrics.
Second‑Stage Control Algorithm: Reinforcement Learning
To overcome CEM’s limitations—insufficient personalization granularity and short‑term focus—the team adopted a multi‑objective RL framework based on DDPG (Deep Deterministic Policy Gradient). The RL pipeline treats a user’s N‑day ad exposure as a trajectory and maximizes cumulative reward.
Trajectory
Each user’s ad exposure sequence is modeled as a decision trajectory; the RL agent aims to maximize the total reward over the trajectory.
State
State features include user profile vectors, contextual signals, and ad‑queue characteristics, providing a comprehensive description of a user’s preference for recommendation relevance.
Action
The RL agent outputs adjustments to the parameters previously produced by CEM, aligning the dimensionality with CEM’s output while allowing finer control.
Actor/Critic Architecture
Actor (dual‑head): Shares a bottom‑level embedding, then splits into two branches for separate training before merging.
Critic: Takes (State, Action) as input and consists of multiple MLP layers.
Reward Design
Rewards are split into core metrics (e.g., contextual relevance, CTR) and guard‑rail metrics (e.g., consumption stability, repeat‑view rate). The objective is to maximize core metric lift while ensuring guard‑rail metrics do not degrade.
Two reward formulations were explored:
MinLift: Optimizes the worst‑performing guard‑rail metric, lifting it whenever it becomes the bottleneck.
LogSumExp: A smooth, non‑linear function that penalizes any decrease in guard‑rail metrics exponentially, providing a more balanced trade‑off.
Engineering Practice
To ensure stable deployment, the team upgraded data pipelines, training frameworks, and inference services:
Sample side: Built a Mixer‑track based data path, added aligned feature streams, and packaged samples as SequenceExample.
Training side: Wrapped the tf‑agents platform, supporting model checkpointing, cold‑backup, graph consistency, and feature importance analysis.
Inference side: Supported SequenceExample inputs and RL inference with feature consistency checks.
Algorithmic Challenges and Solutions
Challenge 1: Large Action Space
Parameter ranges spanned five orders of magnitude, causing convergence issues. The solution was to let the RL agent learn the *change* in parameters rather than absolute values, and to apply normalization, gradient clipping, and adaptive task weighting.
Challenge 2: Saturated tanh Activation
tanh outputs were stuck at -1 or 1. Adding a BatchNorm layer before tanh normalized the inputs, preventing saturation.
Challenge 3: Multi‑Objective Reward Balancing
Initial reward functions required extensive manual weighting. The MinLift approach lifted the worst guard‑rail metric, while LogSumExp provided a smoother penalty for any guard‑rail degradation, leading to better overall performance.
Results and Deployment
Two pilot deployments demonstrated significant gains:
Public‑account pilot: contextual relevance increased by 14%.
Tencent News pilot: CTR improved by 0.65%.
Future Directions
Planned work includes:
Adopting a Multi‑Critic architecture to achieve Pareto‑optimal solutions without manual tuning.
Exploring end‑to‑end learning that directly outputs ranking scores, removing linear formula constraints.
References
[1] Zhang Q H, Liu J N, Dai Y Z, et al. Multi‑Task Fusion via Reinforcement Learning for Long‑Term User Satisfaction in Recommender Systems. arXiv:2208.04560, 2022.
[2] Cai Q, et al. Two‑Stage Constrained Actor‑Critic for Short Video Recommendation. WWW ’23, 2023.
[3] Miao D., et al. Sequential Search with Off‑Policy Reinforcement Learning. CIKM ’21, 2021.
[4] Cao Y., et al. xMTF: A Formula‑Free Model for RL‑Based Multi‑Task Fusion in Recommender Systems. arXiv:2504.05669, 2025.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
