Artificial Intelligence 17 min read

Model-Free Reinforcement Learning for ROI Optimization: Methods, Advertising Applications, and Tencent Game Advertising Practice

This article introduces model‑free reinforcement learning fundamentals, reviews mainstream solution methods such as Monte‑Carlo, Temporal‑Difference, and n‑step TD with eligibility traces, discusses their application in online advertising and presents Tencent's game advertising practice, including algorithm choices, reward design, and experimental results.

IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
Model-Free Reinforcement Learning for ROI Optimization: Methods, Advertising Applications, and Tencent Game Advertising Practice

1. Model-Free Algorithms

1.1 Introduction

Reinforcement learning (RL) aims to learn a mapping from the current state to actions that maximize cumulative reward. The agent discovers optimal behavior through trial‑and‑error without being told how to act, and the chosen action influences both immediate and future rewards. By formalizing RL as an optimal control problem for a partially observable Markov decision process, we distinguish model‑based RL (which builds a dynamics model) from model‑free RL (which learns directly from interaction).

Model‑based RL creates separate models for state transition and reward prediction, enabling offline planning but suffering from model bias in complex state spaces. Model‑free RL interacts with the environment in real time, avoiding model errors but requiring large amounts of data and training time.

1.2 Model-Free RL Methods

Unlike dynamic programming, model‑free prediction learns value functions directly from past agent‑environment interactions without knowing transition probabilities or reward functions.

1.2.1 Monte‑Carlo Methods

Monte‑Carlo (MC) methods estimate state values by averaging returns from complete episodes. The return Gₜ is defined as:

Gₜ = Rₜ₊₁ + γRₜ₊₂ + γ²Rₜ₊₃ + … + γ^{T‑1}R_T

Two variants exist: first‑visit MC and every‑visit MC.

1.2.2 Temporal‑Difference (TD) Learning

TD learning combines ideas from MC and dynamic programming. It updates value estimates online after each step using the TD target, which incorporates the immediate reward and the estimated value of the next state, thus exploiting the Markov property.

Key differences from MC:

Updates are online (no need to wait for episode termination).

TD leverages the Bellman expectation equation, resulting in lower variance but higher bias compared to MC.

TD is generally more sample‑efficient.

1.2.3 n‑step TD Methods

n‑step TD bridges MC (n = episode length) and one‑step TD (n = 1). The update rule is:

V_{t+n}(S_t) = V_{t+n‑1}(S_t) + α[ G_{t:t+n} – V_{t+n‑1}(S_t) ]

where G_{t:t+n} = R_{t+1} + γR_{t+2} + … + γ^{n‑1}R_{t+n} + γ^{n}V_{t+n‑1}(S_{t+n}) .

1.2.4 Eligibility‑Trace TD(λ)

TD(λ) combines all n‑step returns by weighting them with a decay parameter λ∈[0,1]. λ = 1 yields MC, λ = 0 yields one‑step TD. The TD target becomes a weighted sum of multi‑step returns, improving bias‑variance trade‑off.

1.3 Model-Free Control

Control aims to find a policy that maximizes the value function. Two learning paradigms are used:

On‑policy: learning from data generated by the current policy.

Off‑policy: learning from data generated by a different (e.g., expert) policy.

Both rely on the Generalized Policy Iteration (GPI) framework.

2. Applications of Reinforcement Learning in Advertising

Representative papers include:

Real‑time Bidding for Online Advertising: Measurement and Analysis

Optimal Real‑Time Bidding for Display Advertising

Real‑Time Bidding by Reinforcement Learning in Display Advertising

Deep Reinforcement Learning for Sponsored Search Real‑time Bidding

Budget‑Constrained Bidding by Model‑free Reinforcement Learning in Display Advertising

Deep Reinforcement Learning for Online Advertising in Recommender Systems

Optimized Cost per Mille in Feeds Advertising

Dynamic Pricing on E‑Commerce Platform with Deep Reinforcement Learning

3. Tencent Game Advertising Practice

3.1 Technical Choices

3.1.1 A3C

A3C extends DQN by using multiple asynchronous actor‑learners that share a global network, avoiding the correlation problems of experience replay and enabling efficient parallel training.

3.1.2 PPO

PPO is a policy‑gradient method that stabilizes training by limiting policy updates with a clipped objective, offering easier implementation than TRPO while maintaining performance.

3.1.3 Experimental Results

Due to sparse training data, A3C showed sensitivity to initialization. Switching to PPO yielded more stable learning and faster convergence, as illustrated by the cumulative reward curve below.

3.2 Reward Design Aligned with Business Goals

In advertising, the reward must reflect ROI while respecting budget constraints. A generalized ROI metric was introduced to address sparsity, and action‑matching rewards were applied to encourage alignment between predicted actions and historical outcomes.

Offline tests showed a 20% increase in cumulative reward using generalized ROI and a 15% increase with action‑matching rewards, along with more stable training.

3.2.1 Evaluating Single‑Step Reward

Direct ROI as reward is too sparse; a generalized ROI formulation provides denser feedback.

3.2.2 Incorporating Constraints Over Time

Cumulative positive actions are kept within budget limits; exceeding the constraint triggers resampling during training.

Other References

Qin Zhihui, Li Ning: “A Survey of Model‑Free Reinforcement Learning” – Computer Science, 2007.

Volodymyr Mnih, Adrià Puigdomènech Badia: “Asynchronous Methods for Deep Reinforcement Learning” (arXiv:1602.01783).

John Schulman, Filip Wolski: “Proximal Policy Optimization Algorithms” (arXiv:1707.06347).

advertisingreinforcement learningPPOTencentROI optimizationA3Cmodel-free RL
IEG Growth Platform Technology Team
Written by

IEG Growth Platform Technology Team

Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.