How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Tencent's ad tech team redesigned its ad ranking system by adding a parallel user‑experience‑optimized pipeline and evolving from manual CEM tuning to DDPG‑based reinforcement learning, achieving 10‑20% improvements in CTR, repeat‑view rates, and other experience metrics while maintaining overall spend.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Background and Problem

Traditional ad ranking in Tencent’s platform prioritized short‑term eCPM maximization, often sacrificing user experience and producing a "non‑recommendation" feel. High‑value but low‑experience ads could dominate impressions, harming metrics such as click‑through rate (CTR), repeat views, and fast‑scroll rates.

New Parallel Ranking Mechanism

The team introduced a parallel ranking track explicitly optimized for user experience. This redesign incorporated multi‑objective ranking formulas with controllable parameters for revenue and experience, atomic target factors, and fine‑grained control at both user‑segment and request levels.

Evolution of Parameter Tuning

Initially, parameters were tuned manually.

Later, an automated CEM (Cross‑Entropy Method) search replaced manual tuning.

Most recently, reinforcement learning (RL) has been piloted in public accounts, improving contextual relevance by over 15% while keeping overall consumption stable.

First‑Stage Control Algorithm: CEM Search

The CEM approach was enhanced to run at hour‑level granularity using a simulation system, enabling rapid parameter discovery. Reward functions initially treated all metrics equally, but later distinguished core metrics from guard‑rail metrics.

Second‑Stage Control Algorithm: Reinforcement Learning

To overcome CEM’s limitations—insufficient personalization granularity and short‑term focus—the team adopted a multi‑objective RL framework based on DDPG (Deep Deterministic Policy Gradient). The RL pipeline treats a user’s N‑day ad exposure as a trajectory and maximizes cumulative reward.

Trajectory

Each user’s ad exposure sequence is modeled as a decision trajectory; the RL agent aims to maximize the total reward over the trajectory.

Trajectory modeling diagram
Trajectory modeling diagram

State

State features include user profile vectors, contextual signals, and ad‑queue characteristics, providing a comprehensive description of a user’s preference for recommendation relevance.

Action

The RL agent outputs adjustments to the parameters previously produced by CEM, aligning the dimensionality with CEM’s output while allowing finer control.

Actor/Critic Architecture

Actor (dual‑head): Shares a bottom‑level embedding, then splits into two branches for separate training before merging.

Critic: Takes (State, Action) as input and consists of multiple MLP layers.

Actor‑Critic network diagram
Actor‑Critic network diagram

Reward Design

Rewards are split into core metrics (e.g., contextual relevance, CTR) and guard‑rail metrics (e.g., consumption stability, repeat‑view rate). The objective is to maximize core metric lift while ensuring guard‑rail metrics do not degrade.

Two reward formulations were explored:

MinLift: Optimizes the worst‑performing guard‑rail metric, lifting it whenever it becomes the bottleneck.

LogSumExp: A smooth, non‑linear function that penalizes any decrease in guard‑rail metrics exponentially, providing a more balanced trade‑off.

Reward function comparison
Reward function comparison

Engineering Practice

To ensure stable deployment, the team upgraded data pipelines, training frameworks, and inference services:

Sample side: Built a Mixer‑track based data path, added aligned feature streams, and packaged samples as SequenceExample.

Training side: Wrapped the tf‑agents platform, supporting model checkpointing, cold‑backup, graph consistency, and feature importance analysis.

Inference side: Supported SequenceExample inputs and RL inference with feature consistency checks.

RL training and AMS integration diagram
RL training and AMS integration diagram

Algorithmic Challenges and Solutions

Challenge 1: Large Action Space

Parameter ranges spanned five orders of magnitude, causing convergence issues. The solution was to let the RL agent learn the *change* in parameters rather than absolute values, and to apply normalization, gradient clipping, and adaptive task weighting.

Action space constraint diagram
Action space constraint diagram

Challenge 2: Saturated tanh Activation

tanh outputs were stuck at -1 or 1. Adding a BatchNorm layer before tanh normalized the inputs, preventing saturation.

tanh activation distribution before/after BN
tanh activation distribution before/after BN

Challenge 3: Multi‑Objective Reward Balancing

Initial reward functions required extensive manual weighting. The MinLift approach lifted the worst guard‑rail metric, while LogSumExp provided a smoother penalty for any guard‑rail degradation, leading to better overall performance.

MinLift vs LogSumExp reward curves
MinLift vs LogSumExp reward curves

Results and Deployment

Two pilot deployments demonstrated significant gains:

Public‑account pilot: contextual relevance increased by 14%.

Tencent News pilot: CTR improved by 0.65%.

Future Directions

Planned work includes:

Adopting a Multi‑Critic architecture to achieve Pareto‑optimal solutions without manual tuning.

Exploring end‑to‑end learning that directly outputs ranking scores, removing linear formula constraints.

References

[1] Zhang Q H, Liu J N, Dai Y Z, et al. Multi‑Task Fusion via Reinforcement Learning for Long‑Term User Satisfaction in Recommender Systems. arXiv:2208.04560, 2022.

[2] Cai Q, et al. Two‑Stage Constrained Actor‑Critic for Short Video Recommendation. WWW ’23, 2023.

[3] Miao D., et al. Sequential Search with Off‑Policy Reinforcement Learning. CIKM ’21, 2021.

[4] Cao Y., et al. xMTF: A Formula‑Free Model for RL‑Based Multi‑Task Fusion in Recommender Systems. arXiv:2504.05669, 2025.

User ExperienceAdvertisingRankingreinforcement learningmulti-objective optimization
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.