Artificial Intelligence 17 min read

How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Tencent's ad tech team redesigned its ad ranking system by adding a parallel user‑experience‑optimized pipeline and evolving from manual CEM tuning to DDPG‑based reinforcement learning, achieving 10‑20% improvements in CTR, repeat‑view rates, and other experience metrics while maintaining overall spend.

Tencent Advertising Technology

Jan 8, 2026

How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Background and Problem

Traditional ad ranking in Tencent’s platform prioritized short‑term eCPM maximization, often sacrificing user experience and producing a "non‑recommendation" feel. High‑value but low‑experience ads could dominate impressions, harming metrics such as click‑through rate (CTR), repeat views, and fast‑scroll rates.

New Parallel Ranking Mechanism

The team introduced a parallel ranking track explicitly optimized for user experience. This redesign incorporated multi‑objective ranking formulas with controllable parameters for revenue and experience, atomic target factors, and fine‑grained control at both user‑segment and request levels.

Evolution of Parameter Tuning

Initially, parameters were tuned manually.

Later, an automated CEM (Cross‑Entropy Method) search replaced manual tuning.

Most recently, reinforcement learning (RL) has been piloted in public accounts, improving contextual relevance by over 15% while keeping overall consumption stable.

First‑Stage Control Algorithm: CEM Search

The CEM approach was enhanced to run at hour‑level granularity using a simulation system, enabling rapid parameter discovery. Reward functions initially treated all metrics equally, but later distinguished core metrics from guard‑rail metrics.

Second‑Stage Control Algorithm: Reinforcement Learning

To overcome CEM’s limitations—insufficient personalization granularity and short‑term focus—the team adopted a multi‑objective RL framework based on DDPG (Deep Deterministic Policy Gradient). The RL pipeline treats a user’s N‑day ad exposure as a trajectory and maximizes cumulative reward.

Trajectory

Each user’s ad exposure sequence is modeled as a decision trajectory; the RL agent aims to maximize the total reward over the trajectory.

State

State features include user profile vectors, contextual signals, and ad‑queue characteristics, providing a comprehensive description of a user’s preference for recommendation relevance.

Action

The RL agent outputs adjustments to the parameters previously produced by CEM, aligning the dimensionality with CEM’s output while allowing finer control.

Actor/Critic Architecture

Actor (dual‑head): Shares a bottom‑level embedding, then splits into two branches for separate training before merging.

Critic: Takes (State, Action) as input and consists of multiple MLP layers.

Reward Design

Rewards are split into core metrics (e.g., contextual relevance, CTR) and guard‑rail metrics (e.g., consumption stability, repeat‑view rate). The objective is to maximize core metric lift while ensuring guard‑rail metrics do not degrade.

Two reward formulations were explored:

MinLift: Optimizes the worst‑performing guard‑rail metric, lifting it whenever it becomes the bottleneck.

LogSumExp: A smooth, non‑linear function that penalizes any decrease in guard‑rail metrics exponentially, providing a more balanced trade‑off.

Engineering Practice

To ensure stable deployment, the team upgraded data pipelines, training frameworks, and inference services:

Sample side: Built a Mixer‑track based data path, added aligned feature streams, and packaged samples as SequenceExample.

Training side: Wrapped the tf‑agents platform, supporting model checkpointing, cold‑backup, graph consistency, and feature importance analysis.

Inference side: Supported SequenceExample inputs and RL inference with feature consistency checks.

Algorithmic Challenges and Solutions

Challenge 1: Large Action Space

Parameter ranges spanned five orders of magnitude, causing convergence issues. The solution was to let the RL agent learn the *change* in parameters rather than absolute values, and to apply normalization, gradient clipping, and adaptive task weighting.

Challenge 2: Saturated tanh Activation

tanh outputs were stuck at -1 or 1. Adding a BatchNorm layer before tanh normalized the inputs, preventing saturation.

tanh activation distribution before/after BN

Challenge 3: Multi‑Objective Reward Balancing

Initial reward functions required extensive manual weighting. The MinLift approach lifted the worst guard‑rail metric, while LogSumExp provided a smoother penalty for any guard‑rail degradation, leading to better overall performance.

Results and Deployment

Two pilot deployments demonstrated significant gains:

Public‑account pilot: contextual relevance increased by 14%.

Tencent News pilot: CTR improved by 0.65%.

Future Directions

Planned work includes:

Adopting a Multi‑Critic architecture to achieve Pareto‑optimal solutions without manual tuning.

Exploring end‑to‑end learning that directly outputs ranking scores, removing linear formula constraints.

References

[1] Zhang Q H, Liu J N, Dai Y Z, et al. Multi‑Task Fusion via Reinforcement Learning for Long‑Term User Satisfaction in Recommender Systems. arXiv:2208.04560, 2022.

[2] Cai Q, et al. Two‑Stage Constrained Actor‑Critic for Short Video Recommendation. WWW ’23, 2023.

[3] Miao D., et al. Sequential Search with Off‑Policy Reinforcement Learning. CIKM ’21, 2021.

[4] Cao Y., et al. xMTF: A Formula‑Free Model for RL‑Based Multi‑Task Fusion in Recommender Systems. arXiv:2504.05669, 2025.

User Experience Advertising Ranking reinforcement learning multi-objective optimization

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.